Oct 13, 2013

Python Dictionary Speed Hacks

The key idea behind PyExcelerate is that we want to go fast. We don't care if it's a micro-optimization or a macro-optimization, we want to squeeze out every bit of performance out of python as possible without making the code an unmaintainable mess. One of the sections of code that was most used was the alignment of Excel's styles. Each cell needs its own ability to edit its style, yet on compilation, the styles need to be compressed so we don't have a million cells that look the same using different styles. In order to check if the style already exists, we use a dict to map each style to it's corresponding style id. That way, if we encounter the same style later, we can just add a reference to the existing style instead of creating a new one.

Turns out that this operation is pretty slow. Profiling the execution of PyExcelerate, we find that somewhere between 40-50% of the execution time is actually spent doing this compression (and we tried just turning off compression, it turns out to be slower as a lot more references need to be built). So what can we do to optimize this?

         6266945 function calls (5983321 primitive calls) in 19.340 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.351    1.351   19.342   19.342 profile.py:1(<module>)
   199970    1.207    0.000    3.454    0.000 Style.py:78(__eq__)
480089/200067    1.052    0.000    3.071    0.000 {hash}
        2    0.949    0.474    8.281    4.140 Workbook.py:45(_align_styles)
   301027    0.876    0.000    1.399    0.000 Utility.py:26(lazy_get)
   179986    0.714    0.000    1.253    0.000 Font.py:62(__eq__)
   151000    0.713    0.000    2.021    0.000 Style.py:42(font)
   151017    0.561    0.000    0.561    0.000 Font.py:6(__init__)
   399940    0.554    0.000    0.554    0.000 Style.py:104(_to_tuple)

Looking at the profile, it seems like _align_styles spends a good chunk of its time hashing, so let's see if we can speed that up. Now, obviously we can't rewrite the python hashing function to make it faster. In most cases, the built-in hashing ends up being faster than whatever we can come up with for __hash__, but there is one neat trick for python dictionaries that we can exploit without misidentifying equivalent styles.

The way python dictionaries work is that the dictionary checks to see if the hash exists in the dictionary, and if it does, check for equality to make sure it isn't a collision. In most cases though, the hash function won't produce a collision, so equality is never checked. But hashing is slow, and we want to speed it up. What can we do? Hash less, and offload some of the work to checking equality! Consider the Font class before:

class Font(object):
    def __init__(self, bold=False, italic=False, underline=False, strikethrough=False, family="Calibri", size=11, color=None):
        self.bold = bold
        self.italic = italic
        self.underline = underline
        self.strikethrough = strikethrough
        self.family = family
        self.size = size
        self._color = color

    def __hash__(self):
        return hash(self._to_tuple())

    def __eq__(self, other):
        return self._to_tuple() == other._to_tuple()

    def _to_tuple(self):
        return (self.bold, self.italic, self.underline, self.strikethrough, self.family, self.size, self._color)

For anyone about to point out that I can use self.__dict__.keys() instead, we tried, it's too slow ;)

Right now, on each call of __hash__, all of the attributes are being considered to produce a hash, and __eq__ is rarely called because hash collisions are rare. Now, consider this optimized code:

class Font(object):
    def __init__(self, bold=False, italic=False, underline=False, strikethrough=False, family="Calibri", size=11, color=None):
        self.bold = bold
        self.italic = italic
        self.underline = underline
        self.strikethrough = strikethrough
        self.family = family
        self.size = size
        self._color = color

    def __eq__(self, other):
        return (self.family, self.size, self._color) == (other.family, other.size, other._color)

    def __hash__(self):
        return hash((self.bold, self.italic, self.underline, self.strikethrough))

Now, __hash__ is only hashing half of the number of attributes! As a result, we end up cutting off quite a bit of time in the hashing function overall. Let's compare the execution times:

Execution Times

What's going on here is that we're only hashing half of the attributes and most of the time, that's good enough to determine if the two styles are equal. If the two styles happen to have the same bold, italic, underline, and strikethrough values and only differ on something else, then we fall back to checking equality. But because hashing is now twice as fast and we only lost some of the granularity, we end up with a very noticeable improvement, with style compression now only taking about 20-30% of the execution time.

         5465301 function calls (5166618 primitive calls) in 17.860 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.506    1.506   17.863   17.863 profile.py:1(<module>)
492089/200067    1.150    0.000    3.316    0.000 {built-in method hash}
        2    0.966    0.483    5.479    2.739 Workbook.py:45(_align_styles)
   309027    0.950    0.000    1.497    0.000 Utility.py:26(lazy_get)
   150000    0.785    0.000    1.993    0.000 Style.py:42(font)
   100000    0.695    0.000    0.934    0.000 Range.py:190(coordinate_to_string)
   100000    0.510    0.000    1.436    0.000 Worksheet.py:97(set_cell_style)
   200015    0.504    0.000    3.820    0.000 Style.py:75(__hash__)

Optimal Ratio

In the above example, we implemented the trick hashing half of the number of attributes. What if we tried different ratios? Let \(n\) be the number of attributes and take some ratio \(0 \leq r \leq 1\). The expected value for the number of attribute checks is:

$$E(calls) = E(calls|no\:collision) \times P(no\:hash\:collision) + E(calls|collision) \times P(hash\:collision)$$
$$E(calls) = nr \times r + n \times (1-r)$$

Minimizing this function:

$$\frac{d}{dr} (nr^2 + n(1-r)) = 0$$
$$\Rightarrow r = \frac{1}{2}$$

So our ratio of half the attributes is optimal. Applying this to \(E(calls)\), we find that \(E(calls|r=\frac{1}{2})=\frac{3}{4}E(calls|r=1)\), which appears to be fairly consistent with the experimental results in the execution time plot above. Hooray!

__eq__ Violation

By the above construction, we find that the definition of __eq__ is acutally violated. This is unfortunate because by enforcing the definition and recalculating the optimal ratio, we get \(r = 1\). Therefore, it's better to have this optimization performed on internal classes, or in the case of PyExcelerate, only when the performance gain can be expected.