Delphi Programming

and software in general.

Monday, August 11, 2008

Optimization - Understand your hardware

Paul (of snippet formatting fame) mentioned that he was comfortable with his coding style, but that he would like to learn more about optimizing his code.

The generally recognized and recommended approach to optimization is that you never optimize code until you really must. The reasoning behind this is that you need actual running code before you can pinpoint the real bottlenecks and there is no point in spending time on optimizing code that is run once every blue moon, since you hardly will have any significant gain in performance.

The Delphi compiler is very good at optimizing code. Most of the time, you would be hard pressed to write better assembly code than what the compiler produces.

However, we are the ones that put down the premises for how it will organize and access data. If we tell the compiler to work the rows or the columns, the compiler generally will have to do what we say. We decide data size, data organization, and data access. The best kind of optimization we can do is to think hard about how we organize and access our data.

Once upon a time, before computers became a household appliance, around the time when the computing jungle still reeked of digital dino-dung, I decided to pursue a career in designing microprocessors. This was before I realized that it was - relatively speaking - cheaper to make mistakes in software than in hardware.

I learned some useful lessons on accessing data the hardware way, though. Accessing memory is costly! Very costly! You should avoid it if you can! :)

The ridiculously simple example is that if you store your data in rows, it will potentially be very costly to access it column by column. If you need to process large amounts of data in memory (array / matrix / SSD type operations), the size and alignment of your data matter more than you think.

Here is a small project that you can play around with to see effects of data alignment. The demo allocates 10.000 items at the size of 512 bytes as one block, and adds an offset (0 .. 25) from the original block start to the item address. Each block gets a key value from a Random double; Random seed is reset for each offset iteration. Each fill and sort is repeated 5 times to try to avoid random interference from the OS etc. The "read and write" of the data for sorting, doesn't actually move all the data, just the key (Double) and a simulated move of the block size into a local buffer. A fullblown sort with data move would add a few move moves, which would add more boundary overhead.

This is average fill+sort time (raw TDateTime value). Notice the difference on 8 byte boundaries.

Well known C++ guru, Herb Sutter held a presentation "Machine Architecture: Things Your Programming Language Never Told You" for the North West C++ Users Group. He can convey this knowledge far more entertaining and far more effective than I can. BTW, he also have several interesting articles on concurrency.

To quote the notes on the video: High-level languages insulate the programmer from the machine. That’s a wonderful thing -- except when it obscures the answers to the fundamental questions of “What does the program do?” and “How much does it cost?” The C++/C#/Java programmer is less insulated than most, and still we find that programmers are consistently surprised at what simple code actually does and how expensive it can be -- not because of any complexity of a language, but because of being unaware of the complexity of the machine on which the program actually runs. This talk examines the “real meanings” and “true costs” of the code we write and run especially on commodity and server systems, by delving into the performance effects of bandwidth vs. latency limitations, the ever-deepening memory hierarchy, the changing costs arising from the hardware concurrency explosion, memory model effects all the way from the compiler to the CPU to the chipset to the cache, and more -- and what you can do about them.

Herb Sutter is an excellent speaker, so find a quiet spot, pull out the popcorn, and set aside 1 hour and 56 minutes for a very entertaining and very enlightening session on the true cost of accessing memory.

Video link: