Dec 20, 2013

Write Only Fields in Django REST Framework

For some odd reason, Django REST framework doesn't include any support for write-only fields. There's support for read-only fields, but not write-only. An example use case is for an API call that changes the user's password. You want to verify that their current password is correct as well. Searching online gives a lot of half-baked solutions implementing some hacky "delete it here, add it there" kind of patching.

Unfortunately, the side effects include these fields no longer appearing on the HTML REST interface, among other rather silly issues. Here's my solution:

class ProfileSerializer(serializers.Serializer):
    class Meta:
        write_only_fields = ('current_password','password')

    email = serializers.CharField(required=False)
    password = serializers.CharField(required=False)
    current_password = serializers.CharField()

    def to_native(self, obj):
        ret = self._dict_class()
        ret.fields = self._dict_class()
        for field_name, field in self.fields.items():
            if field.read_only and obj is None:
                continue
            elif field_name in getattr(self.opts, 'write_only_fields', ()):
                key = self.get_field_key(field_name)
                value = self.init_data.get(key, None) if self.init_data else None
                if value:
                    ret[key] = value
                ret.fields[key] = self.augment_field(field, field_name, key, value)
            else:
                field.initialize(parent=self, field_name=field_name)
                key = self.get_field_key(field_name)
                value = field.field_to_native(obj, field_name)
                method = getattr(self, 'transform_%s' % field_name, None)
                if callable(method):
                    value = method(obj, value)
                ret[key] = value
                ret.fields[key] = self.augment_field(field, field_name, key, value)
        return ret

    def restore_object(self, attrs, instance=None):
        return super(ReceptiveSerializer, self).restore_object(
            dict((k,v) for (k,v) in filter(
                lambda x:x[0] not in getattr(self.opts, 'write_only_fields', ()), attrs.items())
            ), instance)

    def validate_current_password(self, attrs, source):
        if self.object is None:
            return attrs
        u = authenticate(username=self.object.email, password=attrs[source])
        if u is not None:
            return attrs
        else:
            raise serializers.ValidationError('OBJECTION!')

The addition of the above to_native and restore_object methods (which you can copy/paste) will permit you to add a write_only_fields property to the Meta class, defining the fields you wish to have short circuit. If you wish to use this in multiple classes, you can extend this to a general serializer as follows:

class ReceptiveSerializerOptions(serializers.SerializerOptions):
    def __init__(self, meta):
        super(ReceptiveSerializerOptions, self).__init__(meta)
        self.write_only_fields = getattr(meta, 'write_only_fields', ())

class ReceptiveSerializer(serializers.Serializer):
    _options_class = ReceptiveSerializerOptions

    def to_native(self, obj):
        ret = self._dict_class()
        ret.fields = self._dict_class()
        for field_name, field in self.fields.items():
            if field.read_only and obj is None:
                continue
            elif field_name in getattr(self.opts, 'write_only_fields', ()):
                key = self.get_field_key(field_name)
                value = self.init_data.get(key, None) if self.init_data else None
                if value:
                    ret[key] = value
                ret.fields[key] = self.augment_field(field, field_name, key, value)
            else:
                field.initialize(parent=self, field_name=field_name)
                key = self.get_field_key(field_name)
                value = field.field_to_native(obj, field_name)
                method = getattr(self, 'transform_%s' % field_name, None)
                if callable(method):
                    value = method(obj, value)
                ret[key] = value
                ret.fields[key] = self.augment_field(field, field_name, key, value)
        return ret

    def restore_object(self, attrs, instance=None):
        return super(ReceptiveSerializer, self).restore_object(
            dict((k,v) for (k,v) in filter(
                lambda x:x[0] not in getattr(self.opts, 'write_only_fields', ()), attrs.items())
            ), instance)

Have it extend serializers.Serializer or serializers.ModelSerializer, whichever floats your boat. I've made a pull request for this update too. :D

Nov 16, 2013

The Presto Hotdogger

Presto Hotdogger

Tired of cooking your hot dogs the plain old boring way? Fear not! You can electrocute them! The most interesting thing about that article though, at least to me, was the multiple mentions of the Presto Hotdogger.

Presto Hotdogger

It was difficult to find information on this, but I did manage to buy one off ebay (you can get one too!). The company that made these, Presto, is still in business, and they still make a wide variety of kitchen cookware and appliances, even though the Hotdogger has been discontinued.

Presto Hotdogger

Born in 1905 in Eau Claire, Wisconsin, Presto actually started out as the Northwestern Steel & Iron Works company. At the time, they manufactured cement mixers, marine engines, farm engines, and a number of other products), and among these products was a steam pressure cooker developed in 1908 for the canning industry. In 1910, the USDA reported that using these cookers was a good way to prevent botulism, and as a result, these cookers became quite the hot commodity. The appliance portion of the company forked into the National Pressure Cooker Company in 1917, and in 1939, it became Presto. If you'd like to read more about Presto's history, they have a wonderful history page on their website.

Presto Hotdogger

Fast forward to 1960, Presto develops the Presto Hotdogger, which basically takes electricity directly from your wall outlet, and pumps it into a hot dog! The appliance itself can cook up to six hot dog simultaneously, and actually does not have a power switch. Instead, to turn it off, you just unplug it. On the bright side, it cooks your hot dogs in 60 seconds, and it does indeed cook them quite well. Do they taste good? Well, that's a different question. Over the next ten years, these Hotdoggers continue selling, but at some point, Presto stops producing them. I couldn't find any exact evidence for why they stopped producing them, but I'd guess that it had something to do with the introduction of the consumer countertop microwave oven, which was introduced in 1967.

Presto Hotdogger

Of course, there could be a variety of other reasons the Hotdogger stopped selling and no future iterations were produced, but the microwave seems to be a much more versatile option to cooking food, including hot dogs, and produces a cooked product that tastes just as mediocre. In fact, conceptually, the Hotdogger isn't that far from a microwave. It uses the hot dog as a resistor between the two electrodes, so it's effectively heating the water inside the hot dog. Similarly, a microwave does the same thing. But even though a microwave produces the same product, there's something cool about electricuting hot dogs that I can't quite pinpoint.

Presto Hotdogger

Oct 13, 2013

Python Dictionary Speed Hacks

The key idea behind PyExcelerate is that we want to go fast. We don't care if it's a micro-optimization or a macro-optimization, we want to squeeze out every bit of performance out of python as possible without making the code an unmaintainable mess. One of the sections of code that was most used was the alignment of Excel's styles. Each cell needs its own ability to edit its style, yet on compilation, the styles need to be compressed so we don't have a million cells that look the same using different styles. In order to check if the style already exists, we use a dict to map each style to it's corresponding style id. That way, if we encounter the same style later, we can just add a reference to the existing style instead of creating a new one.

Turns out that this operation is pretty slow. Profiling the execution of PyExcelerate, we find that somewhere between 40-50% of the execution time is actually spent doing this compression (and we tried just turning off compression, it turns out to be slower as a lot more references need to be built). So what can we do to optimize this?

         6266945 function calls (5983321 primitive calls) in 19.340 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.351    1.351   19.342   19.342 profile.py:1(<module>)
   199970    1.207    0.000    3.454    0.000 Style.py:78(__eq__)
480089/200067    1.052    0.000    3.071    0.000 {hash}
        2    0.949    0.474    8.281    4.140 Workbook.py:45(_align_styles)
   301027    0.876    0.000    1.399    0.000 Utility.py:26(lazy_get)
   179986    0.714    0.000    1.253    0.000 Font.py:62(__eq__)
   151000    0.713    0.000    2.021    0.000 Style.py:42(font)
   151017    0.561    0.000    0.561    0.000 Font.py:6(__init__)
   399940    0.554    0.000    0.554    0.000 Style.py:104(_to_tuple)

Looking at the profile, it seems like _align_styles spends a good chunk of its time hashing, so let's see if we can speed that up. Now, obviously we can't rewrite the python hashing function to make it faster. In most cases, the built-in hashing ends up being faster than whatever we can come up with for __hash__, but there is one neat trick for python dictionaries that we can exploit without misidentifying equivalent styles.

The way python dictionaries work is that the dictionary checks to see if the hash exists in the dictionary, and if it does, check for equality to make sure it isn't a collision. In most cases though, the hash function won't produce a collision, so equality is never checked. But hashing is slow, and we want to speed it up. What can we do? Hash less, and offload some of the work to checking equality! Consider the Font class before:

class Font(object):
    def __init__(self, bold=False, italic=False, underline=False, strikethrough=False, family="Calibri", size=11, color=None):
        self.bold = bold
        self.italic = italic
        self.underline = underline
        self.strikethrough = strikethrough
        self.family = family
        self.size = size
        self._color = color

    def __hash__(self):
        return hash(self._to_tuple())

    def __eq__(self, other):
        return self._to_tuple() == other._to_tuple()

    def _to_tuple(self):
        return (self.bold, self.italic, self.underline, self.strikethrough, self.family, self.size, self._color)

For anyone about to point out that I can use self.__dict__.keys() instead, we tried, it's too slow ;)

Right now, on each call of __hash__, all of the attributes are being considered to produce a hash, and __eq__ is rarely called because hash collisions are rare. Now, consider this optimized code:

class Font(object):
    def __init__(self, bold=False, italic=False, underline=False, strikethrough=False, family="Calibri", size=11, color=None):
        self.bold = bold
        self.italic = italic
        self.underline = underline
        self.strikethrough = strikethrough
        self.family = family
        self.size = size
        self._color = color

    def __eq__(self, other):
        return (self.family, self.size, self._color) == (other.family, other.size, other._color)

    def __hash__(self):
        return hash((self.bold, self.italic, self.underline, self.strikethrough))

Now, __hash__ is only hashing half of the number of attributes! As a result, we end up cutting off quite a bit of time in the hashing function overall. Let's compare the execution times:

Execution Times

What's going on here is that we're only hashing half of the attributes and most of the time, that's good enough to determine if the two styles are equal. If the two styles happen to have the same bold, italic, underline, and strikethrough values and only differ on something else, then we fall back to checking equality. But because hashing is now twice as fast and we only lost some of the granularity, we end up with a very noticeable improvement, with style compression now only taking about 20-30% of the execution time.

         5465301 function calls (5166618 primitive calls) in 17.860 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.506    1.506   17.863   17.863 profile.py:1(<module>)
492089/200067    1.150    0.000    3.316    0.000 {built-in method hash}
        2    0.966    0.483    5.479    2.739 Workbook.py:45(_align_styles)
   309027    0.950    0.000    1.497    0.000 Utility.py:26(lazy_get)
   150000    0.785    0.000    1.993    0.000 Style.py:42(font)
   100000    0.695    0.000    0.934    0.000 Range.py:190(coordinate_to_string)
   100000    0.510    0.000    1.436    0.000 Worksheet.py:97(set_cell_style)
   200015    0.504    0.000    3.820    0.000 Style.py:75(__hash__)

Optimal Ratio

In the above example, we implemented the trick hashing half of the number of attributes. What if we tried different ratios? Let \(n\) be the number of attributes and take some ratio \(0 \leq r \leq 1\). The expected value for the number of attribute checks is:

$$E(calls) = E(calls|no\:collision) \times P(no\:hash\:collision) + E(calls|collision) \times P(hash\:collision)$$
$$E(calls) = nr \times r + n \times (1-r)$$

Minimizing this function:

$$\frac{d}{dr} (nr^2 + n(1-r)) = 0$$
$$\Rightarrow r = \frac{1}{2}$$

So our ratio of half the attributes is optimal. Applying this to \(E(calls)\), we find that \(E(calls|r=\frac{1}{2})=\frac{3}{4}E(calls|r=1)\), which appears to be fairly consistent with the experimental results in the execution time plot above. Hooray!

__eq__ Violation

By the above construction, we find that the definition of __eq__ is acutally violated. This is unfortunate because by enforcing the definition and recalculating the optimal ratio, we get \(r = 1\). Therefore, it's better to have this optimization performed on internal classes, or in the case of PyExcelerate, only when the performance gain can be expected.