Category Archives: coding

Hash functions – efficency and speed

hash function is any function that can be used to map data of arbitrary size to data of a fixed size. The values returned by a hash function are called hash valueshash codesdigests, or simply hashes. Hash functions are often used in combination with a hash table, a common data structure used in computer software for rapid data lookup.

If you happen to need a non cryptographic hash function (and you might need it even if your are using languages that have builtin hashtables like java or c++ ) here are some references taken from various parts:

Interesting comparison of different hash tables :

http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

Some more hashfunctions here :

http://www.cse.yorku.ca/~oz/hash.html

Some really interesting benchmarks here by Peter Kankowsky.

And again hash benchmarks here.

Some (a mininum set) benchmarks ran using User-Agents as keys, 1.5 multiply factor on hash size, next power of 2 hash sizes :

Keys 3130591, Average fh_default_hash(8388608) time in nanosecs : 292.56
Total collisions 517939, longest chain 6, hash_dim_factor 1.500000
----------------------
Keys 3130591, Average djb_hash(8388608) time in nanosecs : 181.24
Total collisions 517445, longest chain 7, hash_dim_factor 1.500000
----------------------
Keys 3130591, Average sax_hash(8388608) time in nanosecs : 244.49
Total collisions 517431, longest chain 7, hash_dim_factor 1.500000
----------------------
Keys 3130591, Average jsw_hash(8388608) time in nanosecs : 141.39
Total collisions 522951, longest chain 14, hash_dim_factor 1.500000
----------------------
Keys 3130591, Average jen_hash(8388608) time in nanosecs : 148.54
Total collisions 518565, longest chain 7, hash_dim_factor 1.500000
----------------------
Keys 3130591, Average djb2_hash(8388608) time in nanosecs : 182.37
Total collisions 516405, longest chain 7, hash_dim_factor 1.500000
----------------------
Keys 3130591, Average sdbm_hash(8388608) time in nanosecs : 238.93
Total collisions 525513, longest chain 7, hash_dim_factor 1.500000
----------------------
Keys 3130591, Average fnv_hash(8388608) time in nanosecs : 237.01
Total collisions 517530, longest chain 7, hash_dim_factor 1.500000
----------------------
Keys 3130591, Average oat_hash(8388608) time in nanosecs : 294.78
Total collisions 517939, longest chain 6, hash_dim_factor 1.500000
----------------------


Tests done with the hashtable code you can find here

And last a research study (which I’m still working on) for SIMD optimized hash functions : here  some theory and code here. The all use the vector extensions present in gcc.

 

Code benchmarks : how can I measure how fast my software is (and make it faster) ?

You probably heard many times a coder say “this way is faster” : nothing can be more wrong nowadays unless you prove your statement by measuring the same piece of code running with algorithm A and algorithm B (the faster one). And there were times when you could measure the speed of the execution time of a program 1, 2 or 10 times without noticeable variability : those times are gone. Measuring how fast a piece of software runs is about getting a set of results with the lowest possible variability.

Variability in computing time in modern computer architectures is just unavoidable; while we can guarantee the results of a computation we cannot guarantee how fast this computation will be :

“Computer can reproduce anwsers, not performance” : Bryce Adelstein Lellback, https://youtu.be/zWxSZcpeS8Q?t=6m45s

Reasons for variance in computation time can be recap in :

  • Hardware jitter : instruction pipelines, cpu frequency scaling and power management, shared caches and many other things
  • OS activities : a huge list of things the kernel can do to screw up your benchmark performance
  • Observer effect : every time we instruments code to measure performance we introduce variance.

Also warming up the cpu seems to have become necessary to get meaningful results. Running hot instead of cold on a single piece of code is well described here :

https://youtu.be/zWxSZcpeS8Q?t=18m51s

This said the process for benchmarking an piece of code is :

  • Write some benchmark code : define a subset of execution (let’s call it op1) in your code and measure how many op1/sec. Do the same for other parts (op2, op3) but be sure that all parts can be run indipendently from each other   
  • Assure that samples population that we get is “normal” distributed

In details I use C code to measure and :

  • Use GNU Scientific Library (gsl) for mean, median, variance, stdev, skew running stats
  • Use CLOCK_MONOTONE clock_gettime() mode. An example code here.
  • Check if the set returned has ‘normal’ distribution : find the algorithm you like here (I use mean-median/max(mean,median) < 1%, not really correct but ok for my purposes).
  • Narrow down as much as possible the size of the code you are measuring
  • Warm up the cpu before taking samples to avoid having too much variability. You do this by running the code you want to measure many times before actually taking samples. Ideally you should run it until you have normal distributed results. If you don’t then probably the piece of code you are measuring is too big and you are experiencing the effects of other perturbations (os ? io ?).
  • Consider using a benchmark framework : google benchmark (c++) is great for this but I’m afraid you will have to measure C++ code.
  • Lock the cpu clock (if you can) on the host you are using for benchmarks.

Prepare to see strange results 🙂

gcc/clang -fsanitize is saving lifes

 There where times when you had to write C/C++ code and find all the bounds errors and memory leaks by hand or with ancient tools (who remembers Purify?). Valgrind introduced a lot of features but sanitize features added in gcc/clang are just awesome (thanks google guys for this). Just add -fsanitize=address both to compilation and link of your unit tests (we use google test framework), execute the test and get valuable info on memory leaks and out of bounds access to data. In detail :

Testing and inclusion of sanitize is under way in our lab : more info on this in a future post. Work from this team led to the inclusion in Golang of the Data Race detector.

 

Creating logstash development environment for core or plugins

I had to develop a ruby logstash plugin recently and I going to recap all necessary steps to create a clean jruby env for development :

  • i had to uninstall ruby and install a clean jruby only stack. Not saying that this is necessary but in my case it only worked this way
gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
curl -sSL https://get.rvm.io | bash -s stable --ruby=jruby-9.1.10.0
rvm alias create default jruby-9.1.10.0
source $HOME/.rvm/scripts/rvm
ruby -v
jruby 9.1.10.0 (2.3.3) 2017-05-25 b09c48a OpenJDK 64-Bit Server VM 25.151-b12 on 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 +jit [linux-x86_64]
  • Install and compile logstash just to see if everything works. Check out logstash and from inside the directory :
jruby -S gem install rake bundler
rake bootstrap
rake plugin:install-default
bin/logstash -e 'input { stdin { } } output { stdout {} }'

  • Now compile the plugin from the plugin folder
bundle install
bundle exec rspec # test it
gem build logstash-filter-<yourplugin>.gemspec
  • Install the plugin gem and test it
bin/logstash-plugin install logstash-filter-<yourplugin>-1.0.0.gem

 

Google C++ style guide : something every C++ coder out there should read

Having over 100 million lines of C++ code, google took the C++ problem (language with so many features, different ways of doing the same things, dangerous constructs, some pointers to the problems with c++ can be found on wikipedia, and here for an authoritative opinion)  seriously and this is what came out.

Some interesting decisions taken :

“We do not use C++ exceptions.” from here

“Avoid defining macros, especially in headers; prefer inline functions, enums, and const variables. Name macros with a project-specific prefix. Do not use macros to define pieces of a C++ API.” from here.

“Avoid complicated template programming” from here

 

REST API Server and Windows : ASP.NET Core kestrel, OWIN or plain old IIS + ASP.NET Framework ?

If you are planning to build API and make them available as micro services and your code assets are C#/.NET Framework/Windows then you have some choices :

  1. ASP.NET Core comes with an internal  web server called Kestrel : kestrel claims to be a light and fast web framework which runs ASP.NET Core applications without the need for IIS in front (you might decide that IIS is usefull for other things). Kestrel allows for middleware modules (like IIS ISAPI filters) that allow for adding features to the basic server.
  2. Katana (OWIN Microsoft implementation) is a collection of NuGET packages for building OWIN applications.
  3. Plain old IIS + ASP.NET REST API code

Some goo information can be found here from uShip guys
I’m basically writing this for myself as a reminder for the new .NET Core Microsoft environment.

On the importance of having a second head to test code and TDD ( Coders just don’t want to break their code )

Good Coders don’t want to break their code because their code is like a creature and nobody wants to harm a creature. Take a note, this is how it is. Good Coders tests are just ‘combing’ the code they should test. But this creature has bugs, no matter how good the Coder is, so good Coders are bad at writing tests on their own code.

A good test tries to tear code apart, to abuse it doing the nastiest things, at the maximum possible speed, in the worst possible way : this is why you should consider making another set of tests by another coder which has not written the code itself.

But with TDD this is difficult : I like the idea that a Coder writes it’s software use cases before writing the software itself but I’m missing how to include in this process a second head to develop some really nasty tests on that piece of code and I’m not a great fan of pair programming (for many reasons).

So open issue for me here : how to include development by a second head of integration tests.

What’s wrong with word “simple” ? ( on over engineering code )

Over engineering code is a common plague that leads to a number of unwanted side effects in your software ranch. This post does not want to be an exhaustive guide on over engineering but with the help of many quotes  from different sources, I’ll try to summarize the impacts, when you can have evidence of over engineered code and the causes (to try to avoid them).

Impacts

  • unneccessary complexity => less agility in software
  • more code, more bugs
  • more code, more unit tests, more development time
  • performance impacts
  • longer catch up time for newcomers
  • difficult maintenance, fixing bugs requires more time than it should
  • small changes require big efforts and are prone to having many bugs
  • unwanted features are present even though there is not reference in requirements/backlogs and so you have to maintain them even if they are not used
  • false perception of complexity in the code that leads to over inflated estimates for fixes, changes.
  • extending an over engineered project makes extensions more complex than necessary
  • dismantling of over engineered code takes way longer than it took to over engineer it
  • Need for complete rewrite over time to reduce unneeded complexity when maintenance times become higher than rewrite time This is an interesting topic itself and needs a separate post)

Concrete symptons  of over engineered code

Really difficult here to point out some easy and simple guidelines to recognize over engineered code. Often it is an individual perception and this does not help. I’ll try :

  • When a well experienced coder, which has been working on that piece of code for at least an year, still takes way more than necessary/affordable to figure out where/how a specific feature works. I’m trying to find a better explanation for this concept
  • In code : using a factory even if it is making type of objects
  • In code : use an interface even if it is actually going to be implemented by just one class
  • From PrematureGeneralization

“This is really going to be a clean framework. I’ll make an abstract class out of this part so that folks can subclass it later, and I’ll put in a bunch of well-commented overridable hooks in the concrete subclasses so that folks can use them as templates, and just in case somebody ever needs to build special debug subclasses, I’ll put in extra stubs over there (somebody will thank me for ’em one of these days). Yeah, this is really going to be neat.Thus is bloated software produced when our artistic sense gets the better of us. — DaveSmith “

Some other signs of over engineering taken from stackoverflow :

One very strong warning sign of over engineering is when everything goes through so much indirection that it’s hard to find the piece of code that actually implements some concrete, domain-level piece of functionality. If you find that most of your functions do very little concrete work and just call other virtual functions, you may have a problem.

https://softwareengineering.stackexchange.com/questions/193929/too-object-oriented?newreg=f7722bef1ba7496e9d8b4e7d328cadf1

[..] Make factories where the factories make more then one type of object. Use dependency injection, where it immediately shows benefits. Make interfaces that are actually going to be implemented by more then one class [..] What I see too often in “true OO” is that advanced techniques are used to solve really simple problems in an overly complex way [..]

Causes of over engineering :

Accidental vs essential complexity

Interesting article, some really good points :

Software should behave predictably and accomplish its goals without too many surprises (that is, outages in production). The number of surprises directly correlates with the amount of unnecessary complexity found in a project. It’s therefore crucial to think about accidental complexity and essential complexity:

Accidental complexity relates to problems which engineers create and can fix, [whereas] essential complexity is caused by the problem to be solved, and nothing can remove it
— Fred Brooks in his seminal “No Silver Bullet” essay

Another interesting point on when a coder start to write unnecessary code :

Boredom is good precursor to over-engineered code. I’ll admit, when I got my first job, I felt so underutilized. I was just bored. And when I got bored, I wrote code. Not just any code — CATHEDRALS OF CODE.

No seriously, I had a mental picture of my code and abstractions as large towers with golden jutting spires, flying buttresses of glassy onyx, a wonderful vault supporting by arched domes topped with beautiful geometrical tracery, etc etc etc.

It was really fascinating to see the patterns working together for myself, but in retrospect, I am completely ashamed of the ungodly mess I left behind.

If you’re writing your own frameworks and DSLs code to while away the less stimulating hours at work, just stop. Time is better spent reading Wards Wiki, or writing an open source book, or you may just want to ask management for more work.

Some theory on why/when over engineering happens

  • Second system effect : you wrote a first good, simple, performant version of a product and it is getting traction so you decide to write 2.0 version of it which does everything.
  • feature creep : you just keep adding features that are used by less than 1 % of the user base/use cases.
  • inner-platform effect : you are making your system so customizable to become a replica of the features provided by the sw component of dev environment you are using.

Techniques to avoid over engineering

Beware of complicated solutions (that someone was paid to find) @ntaleb

I’m still working on this point : here are some  references to interesting concepts

The KISS principle

YAGNI : Your Arent Gonna Need It

DRY : Don’t repeat yourself

Simplicity is a prerequisite for reliability

The “Fix it or rewrite it from scratch” dilemma

I’m sure you found yourself in the situation of having to fix a piece of code that :

  • was not written by you
  • has not a clear set of requirements or the requirements are clear but the code is unnecessarily complex (at a first glance looks like you encountered the dreaded “big ball of mud”)

My first question here is to understand where is the break even between fix and rewrite. I tend to prefer rewriting code if :

  • besides needing fixes there are also performance problems
  • code needs to be fixed/changed often
  • code is a company asset
  • code complexity is exposed to customers
  • coders turnover is high

Remember that :

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
— Kernighan and Plauger in “The Elements of Programming Style”

SmartOS is like having vagrant/docker/virtualbox builtin in your OS

I recently had a chance to try out SmartOS after my last experience in application porting on it (2012) and I was impressed by the virtualization features that are made available directly by the os. One set of commands allows downloading images, running either vm like container or namespace like  containers. Zones also allow you to run debian/centos applications on SmartOS inside a lx zone with system call translation … Awesome.

If your interested in digging deeper take a look at this post from Tim Boudreau.