Category Archives: technology

glibc 2.25 bug : strstr() runs 10 times slower than on 2.24

Linux is used on 54.9% of the world websites : almost every application running on a linux machine uses the glibc which provides the core libraries to access almost every feature of a linux system. The Mighty Glibc started back in 1988 and is a wonderful and glorious project.
As far as the string functions are concerned the sse / avx optimized versions of these functions (strlen, strcpy, strstr, strcmp and more) are up to 10 times faster than their corresponding standard c implementations (which for example you might find in the libmusl) when run on a sse/avx capable cpu.

We rely a lot on glibc string functions and that’s why we found that glibc 2.25 introduced some optimization on the AVX capable processors and this disabled sse* optimizations for methods that don’t have a avx2 optimized implementation (strstr, strcat, and I’m afraid parts of the math functions). For further details go here.
The bug affects ubuntu 18, debian 10, fedora 26 to 28.
A fix will come for sure, hopefully in glibc 2.29.

Update on November 3, 2020 : This bug was fixed in the package glibc – 2.27-3ubuntu1.3

Measuring memory footprint of a linux/macosx application

If you’re selling an API or an application which is deployed on production systems, one of the questions your customers might ask you is what is the memory footprint of your API/application in order for them to account for an increase of memory requirements due to using your product. After some research I think that the best tool for measuring and debugging any increases/decrease of your mem footprint is valgrind –tool=massif together with ms_print reporting tools.

Massif is a Heap memory profiler and will measure how much/when you allocate heap memory in your code and show the involved code. Run :

valgrind --tool=massif

this will execute the code and generate a massif.out.<pid> file that you may visualize with

ms_print massif.out.<pid>

Take a ride, the output is absolutely useful and you will have an histogram of how much memory is used at every sampling moment.

Design for simplicity : software options

xkcd.com

(I was highly inspired by Evan Martin post before trying to recap in this post)
Think twice before adding an option to a software; options are :

bad user experience : good software just works without asking to many things to the user.
a maintenance cost : an option is a feature you have to maintain/test for the lifetime of the software
documentation cost : every option, even the most cryptical, should be documented to the user, explained and you should provide some examples of use. When you normally try to do this you end up removing the option from documentation
normally options are a way to delegate to user a decision that the designer failed to take : “shall we open on the last opened document or shall put an options ‘always open on blank document’ ?
remember the paradox of choice
options setting generates bad GUIs : they are hard to present, group or organize.

On the other hand consider that any change you make will probably break someone’s workflow, no doubt, so the temptation to create an option to enable the previous operational mode is high.

The old windows search wizard, awesome, read this : https://www.joelonsoftware.com/2000/04/12/choices/

WTF/Minute : code quality measure unit

Some day your code will be reviewed, modified or even just read to understand the underlying algorithm by someone else that is not you that wrote the code….

My point here is that readability/maintainability of code is by far the biggest quality of a piece of software. Being able to read the code and understand it rapidly makes everything easier, either bugfixing or modifying/enhancing the code. Think that the next one that might be reading the code might not be as experienced as you with that language so

avoid using unnecessarily complicated constructs that feed you coder ego but slow down the maintenance activity. You are a good coder, no doubt, but if your code is not readable by the average coder that will have to then you are not making a good service to your colleagues
Make clear what your code is doing and, if necessary use a comment to clarify what is not evident from the code (I’m not a huge fan of comments because most of them don’t tell anything more than the code). Use comments to refer to a requirement/user-story that is being implemented.
Stick to the language style guidelines that were set for your team/company
Tests are code too and often are read/reviewed more than real code : don’t fuck up in tests code just because “this is just a test, this code will not run in production”

Discussion on code quality is huge of course and these are simple rules that will impact on maintainability of your code (since a reviewer is doing the same activity of the next coder which will have to work on your code : read it). If you want to dig deeper into code quality measurement I’ll suggest this work from Idan Amit and Dror G. Feitelson really interesting : https://arxiv.org/pdf/2007.10912.pdf

Comic comes from here

Hash functions – efficency and speed

A hash function is any function that can be used to map data of arbitrary size to data of a fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. Hash functions are often used in combination with a hash table, a common data structure used in computer software for rapid data lookup.

If you happen to need a non cryptographic hash function (and you might need it even if your are using languages that have builtin hashtables like java or c++ ) here are some references taken from various parts:

Interesting comparison of different hash tables :

http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

Some more hash functions here :

http://www.cse.yorku.ca/~oz/hash.html

Some really interesting benchmarks strchr.com/hash_functions by Peter Kankowsky.

And again hash benchmarks sanmayce.com/Fastest_Hash/.

Hash benchmarks in Rust by https://medium.com/@tprodanov/benchmarking-non-cryptographic-hash-functions-in-rust-2e6091077d11

And more benchmarks : https://medium.com/logos-network/benchmarking-hash-and-signature-algorithms-6079735ce05

Some (a mininum set) benchmarks ran using random keys of random len, 1.5 multiply factor on hash size, next power of 2 hash sizes :

Tests are run on random keys
--- hash_function speed on random long (100-350) keys
Keys    HashFunc         HashSize AvgTime(ns) Collisions LongestChain DimFactor
1000000 djb_hash         4194304    191.03      110620    6            1.500000
1000000 murmur64a_hash   4194304    60.87       110830    7            1.500000
1000000 wyhash_hash      4194304    36.69       110170    5            1.500000
1000000 elf_hash         4194304    444.25      110488    5            1.500000
1000000 jen_hash         4194304    143.90      110471    5            1.500000
1000000 djb2_hash        4194304    189.63      111048    6            1.500000
1000000 sdbm_hash        4194304    247.92      110087    6            1.500000
1000000 fnv_hash         4194304    247.87      110170    6            1.500000
1000000 oat_hash         4194304    309.55      110538    6            1.500000

--- hash_function speed on short keys (10 to 45 char len)
Keys    HashFunc        HashSize AvgTime(ns) Collisions LongestChain DimFactor
1000000 djb_hash         4194304    31.14       110363    6            1.500000
1000000 murmur64a_hash   4194304    28.24       110264    6            1.500000
1000000 wyhash_hash      4194304    21.84       109798    5            1.500000
1000000 elf_hash         4194304    49.88       110845    5            1.500000
1000000 jen_hash         4194304    33.28       110541    5            1.500000
1000000 djb2_hash        4194304    31.21       110664    6            1.500000
1000000 sdbm_hash        4194304    35.60       109906    6            1.500000
1000000 fnv_hash         4194304    35.59       109940    6            1.500000
1000000 oat_hash         4194304    43.28       110123    6            1.500000

Tests done with the hash table code you can find github.com/paulborile/clibs

And last a research study for SIMD optimized hash functions : arxiv.org/pdf/1612.06257.pdf some theory and code github.com/google/highwayhash/blob/master/c/highwayhash.c. The all use the vector extensions present in gcc.

Paul Stephen Borile

software teams management, golang/c coding, bass player, biking addicted