source link: https://www.tuicool.com/articles/hit/FZfi6va
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
The product I was working on has a concept of an
. On a given
there may be one to many
to which users belong. There was a RESTful API that allowed querying the various settings/configuration of
objects of the form
would return all features of team 21. There was also a form of the endpoint that would return all
teams and their respective settings:
In a nutshell, on some accounts with many teams, querying for
was slow. Querying for
was exceptionally slow —
for an account with 500ish teams, it would easily take ten to fifteen seconds to return a response.
My initial gut was: someone probably stuck a database query in a loop or something silly like that, and I’d fix it with a simple change to make a single database call. To confirm this hypothesis, I did a quick profile.
This code was in python, so
was a quick way to get some timings against a production account. A quick
cProfile.run('features = _inner_fetch_all_features("team-*", account_id)')
gave me some results. And the results were very, very unexpected.
The first profile returned in 11.935 seconds. The first thing that leapt out was this code was spending an extraordinary amount of time in some class called
ncalls tottime percall cumtime percall filename:lineno(function)
11825 0.049 0.000 0.060 0.000 attrdict.py:26(__getattr__) 384912/432 1.762 0.000 7.012 0.016 attrdict.py:39(deepcopy) 454032/64800 0.650 0.000 6.943 0.000 attrdict.py:49(<genexpr>) 23650 0.023 0.000 0.023 0.000 attrdict.py:58(__setstate__)
… roughly 59% of the entire time spent
. Whoa. Did not expect. The other item that was fishy involved a library called
46859 0.178 0.000 0.347 0.000 pickle.py:1033(load_dict) 3 0.000 0.000 0.000 0.000 pickle.py:1093(load_global) 3 0.000 0.000 0.000 0.000 pickle.py:1128(find_class) 11825 0.032 0.000 0.088 0.000 pickle.py:1135(load_reduce) 225226 0.497 0.000 0.797 0.000 pickle.py:1156(load_get) 106928 0.180 0.000 0.258 0.000 pickle.py:1170(load_put) 175038 0.393 0.000 0.626 0.000 pickle.py:1199(load_setitem) 11825 0.091 0.000 0.271 0.000 pickle.py:1217(load_build) 58684 0.081 0.000 0.117 0.000 pickle.py:1256(load_mark) 1 0.000 0.000 0.000 0.000 pickle.py:1260(load_stop) 1 0.011 0.011 4.788 4.788 pickle.py:1386(loads) 1 0.000 0.000 0.000 0.000 pickle.py:83(__init__) 1 0.000 0.000 0.000 0.000 pickle.py:835(__init__) 1 1.183 1.183 4.777 4.777 pickle.py:851(load) 58684 0.094 0.000 0.132 0.000 pickle.py:876(marker) 88873 0.230 0.000 0.348 0.000 pickle.py:917(load_int) 2 0.000 0.000 0.000 0.000 pickle.py:967(load_string) 36414 0.116 0.000 0.168 0.000 pickle.py:985(load_unicode) 11825 0.027 0.000 0.056 0.000 pickle.py:999(load_tuple)
The remaining 40% was spent in
, which was some library for serializing/deserializing python objects.
I was a bit stunned at this point; not a single slow spot involved database code, which meant this was purely some algorithmic problem in python. And for some reason, it was slow when there were more teams .
Those Still Waters Run Deep
I decided the first thing I’d go after was
, since that was the slow spot. This also turned out to be a poor assumption, because I didn’t fully understand what this code was doing and why, but we’ll get to that shortly.
looked like so (I’ve removed fluff):
Dict that allows attribute access
def __getattr__(self, name):
return super(AttrDict, self).__getattribute__(name)
def __setattr__(self, name, value):
self[name] = value
def deepcopy(cls, other):
if isinstance(other, basestring):
elif isinstance(other, collections.Sequence):
assert isinstance(other, list)
return [AttrDict.deepcopy(item) for item in other]
elif isinstance(other, collections.Mapping):
assert isinstance(other, (AttrDict, dict))
for key, value in other.iteritems()
was basically a python
, you could write
. This code was likely ripped off from stack overflow somewhere, since whoever wrote this wasn’t the first person to desire this functionality
The second part was
, the offending code. At first glance, it took some object called
and would recursively create a deep copy of that. One interesting side-effect of this was supplied
objects would be turned into
The other interesting part was: this code was using
, and abusively.
is pricy, because it needs to check an object’s inheritance hierarchy to see if the supplied object — or some child class — is of the supplied type.
There was also some pointless asserting (footnote: our asserts run in production. None of these had fired as far back as I had log history for, so they are worthless), and a nonsensical if/else structure.
So win. I can rewrite this to improve performance and that may help. I changed to:
@classmethod def deepcopy(cls, other): other_type = type(other) if other_type in [dict, AttrDict]: return AttrDict( (key, AttrDict.deepcopy(value)) for key, value in list(other.items()) ) elif other_type is list: return [AttrDict.deepcopy(item) for item in other] else: return other
…there was no need to check for string types because they followed the same path as the else path:
And rather than use
each time, use
once and check for specific types I knew were supported. This change was not without risk: it meant I could no longer properly handle inherited objects like the code previously did. To mitigate this, I confirmed this updated code delivered the same results in production.
After configuring prod to re-run this code so I could get a clean profile, the bulk of
now consumed roughly two seconds. Not great
, but acceptable. Given this was less than the other offending chunk of code, I decided to move on to
This Must Be The Pickle
First, I found where we were using pickle. There was a chunk of code that would fetch account features, that cached them to avoid regenerating:
def _all_features(account_id): cache_key = _get_cache_key(account_id) account_features = cache.get(cache_key) if account_features: try: account_features = AttrDict(pickle.loads(zlib.decompress(account_features))) except (...): ... else: return account_features account_features = _split_features(_query_features(account_id)) cache.set(cache_key, zlib.compress(pickle.dumps(account_features))) return account_features
A basic overview:
- Try to fetch features out of a cache.
- If they exist, decompress with zlib, hydrate them with pickle, and return those.
- If they don’t exist, re-generate them, dehydrate with pickle, compress with zlib, save to the cache and return them.
After reading this code, I had more questions than answers. A few thoughts/ideas that went through my head:
The whole reason we were using
picklewas definitely because of
AttrDict. At some point, someone had decided we needed to cache this stuff. But because we were using
AttrDict, we needed a way to serialize/deserialize python objects.
- The use of zlib was odd. after consulting a coworker, I learned this was because a cache object had a 1 MB limit, and for accounts with many teams, we had exceeded that limit. So the zlib was added as a quick stop-gap measure.
The second part kinda floored me; what the heck was pickle persisting that would bump into a 1 MB limit? Also, if all we were ultimately storing was dict/list objects, what did it matter if we used
? We could serialize with json.
I did some quick terminal hacking to see how JSON would compare:
In : json_compressed = zlib.compress(json.dumps(account_features))
In : pickle_compressed = zlib.compress(pickle.dumps(account_features))
In : len(pickle_compressed) Out: 642794
In : len(json_compressed) Out: 50585
…good lord. The compressed payload with
is twelve times the size
. But it gets worse:
In : start = time.clock()
...: json_compressed = zlib.compress(json.dumps(account_features))
...: print time.clock() - start
In : start = time.clock()
...: pickle_compressed = zlib.compress(pickle.dumps(account_features))
...: print time.clock() - start
…working with the json payload was exceptionally faster — roughly 35x for hydrating/dehydrating the payload. Swapping from
alone would solve this problem.
But one major snag: when taking an
, dumping to json, compressing, decompressing and then loading with json, we would lose
entirely — it’d come back as
. And updating vast swaths of our codebase that used
was out of the question. I’d never make it through QA in a reasonable amount of time, nor would I be able to have confidence in such a massive change.
And then I remembered
. I could do this:
account_features = AttrDict.deepcopy(json.loads(zlib.decompress(account_features)))
…when rehydrating, and that would automatically convert
. Problem solved. And adding one additional
operation was irrelevant, even though that was the previous offending code. We were already doing hundreds of them (see profile above) and this codepath was called only once.
Edit: a redditor
pointed out rather than using
, it would be preferable to use
json.loads( zlib.decompress(account_features), object_hook=AttrDict)
Except Not Solved :/
At this point, it was lightning fast for accounts with small team counts, and 2–3 seconds for accounts with 500ish teams. So I created an account with a thousand teams to get a sense of performance.
I was right back at ten seconds for
It was this moment that I put it together; my problem with this endpoint wasn’t bad python code, although that was a contributing factor. It was that I had a big-O problem
. If 500 took me 2.5 seconds, and a 1000 took me 10 seconds, that mapped almost perfectly to
. The underlying algorithmic problem still existed, and it grew with the addition of teams.
Unfortunately, at this point I had to acknowledge one other frustrating reality of engineering. My improvements were already good enough to get this story through QA. We didn’t have any accounts with more than 500 teams. I asked our QA people to create a new defect for 1000 teams to track the issue moving forward, and moved on.
As much as I wanted to figure it out and fix it, perfect is the enemy of good enough . Adios, amigo.
Git History, A Few Take-Aways, A Few Random Musings
I did, however, dig through the git history to figure out what the heck the deal was with
It was written five years prior by a junior developer. The intent was definitely just convenience — the ability to do
some_attr_dict['blah']. The bad if/else logic and nonsensical asserting also made more sense given a junior developer. At this time, however, the codebase was small and there probably was little consequence to
A few days later, a more senior developer added the ability to
picklethis object so we could start caching it. The caching was added to address performance issues. It’s unclear what those performance issues were; there was no clear history on “why” this was done.
- At some point, when the cache exploded as we gained accounts with more and more teams, some other developer added the compression to the caching. I can’t fault this decision; it was an emergency fix when stuff started exploding in production.
My takeaways from all of this:
I think adding a wrapper around basic python types like
listshould be avoided. That decision had serious consequences down the line.
- It is unfortunate the senior developer didn’t push the junior developer towards simplicity for the sake of simplicity. Although, that brings up:
It is unfortunate our code review process did not exist at the time
AttrDictwas written. The original code is the sort of stuff I would have asked be rewritten or justified.
It is unfortunate the performance issues that warranted
pickleweren’t clearly communicated in git history, or some linked defect in an issue tracker. It makes me understand why having clear guidelines around commit messages is so vitally important for large, long-term codebases.
- This is an interesting study in the long-term consequences of a programming choice. It is clear to me this is one of those places where “clever” code had serious consequences and the business would have been better off without it.
- It is clear to me as a senior developer, it is important to mentor junior developers towards the simplest thing that could possibly work .
And some other random, final musings:
It turns out there are heaps of language SNOOTS
out there who have some pretty involved discussions
isinstance; Bjarne Stroustrup even argued against including such a feature in C++, which is kinda mind-blowing since that language can’t seem to find a bell or whistle that shouldn’t be included.
pepguidelines implicitly push people towards
isinstance, and that is arguably not a great thing.
isinstanceisn’t particularly performant, and it absolutely is not the same as
type. In digging through git history, I found a merge request where a senior developer went through our codebase mindlessly replacing
type, and string handling suck due to python 2 -> python 3 conversion issues. For a while I was banging my head on
deepcopystring handling until I realized I didn’t need to do it at all.
Aggregate valuable and interesting links.
Joyk means Joy of geeK