1280px-Sebastian_Vettel_2012_Bahrain

in Development

When the database is fast enough

A while back I gave a general talk about how we decided to pull the plug and move Iconfinder to Django. One of the parts of the presentation, which got the most attention, was about how we reduced the page generation time for an average search from 91 to 20 milliseconds without fiddling with our caching layer, database or search engine. By popular demand, I thought it was worth walking you through exactly what we did.

The database fallacy

The world of web development is governed by a host of fallacies. One of the most prominent is the statement, that the database is always the bottleneck. In other words, the application itself is never slow enough to actually matter, and whatever slowness is there will be fixed in a couple of years with faster CPUs. The solution to database slowness is equally well known; just cache everything.

Even though 91 milliseconds is pretty damn fast to begin with — the Ferrari FXX spending a whole 9 more of those to do a gear change — we generally strive to have heavily loaded parts of the site at or below 50 milliseconds on average — Formula 1 car territory. You might argue that 91 milliseconds is “fast enough” but when you think about the fact that the Django application handles more than 3 million requests every day, and the fact that a majority of those are searches, keeping the response time down is real money. Furthermore, if you want an almost seamless experience, load times including network latency have to be less than half a second. The faster we are at generating a page, the more latency we can tolerate without the site feeling sluggish, no matter where you are in the world. As a user, that’s kind of awesome.

The problem we ran into was, that 91 milliseconds was as fast as things could get. By that, I mean that everything was cached and the time spent getting data from the cache was less than 2 milliseconds. In other words, we were spending 89 milliseconds doing real work. Not so much of a bottleneck now, huh, mr. Database?

Collecting evidence

We use uWSGI to host our Django application. The uWSGI log files have the little nicety that they contain a log line per request including the time taken by the application itself to respond, which is where the 91 millisecond figure originates. This is probably as accurate as it’s going to get without touching the actual code, but if we are to figure out what is actually going on, there’s no way around doing a bit of work. We need to profile the code.

Now, profilers come in a lot of shapes and forms ranging from instrumenting or deterministic profilers, which basically tracks the time between logical units of execution, to sampling profilers, which interrupt the code execution at a set interval and figures out where we are to build a statistical profile. Even though the former is more precise and the latter is often faster, common for all profilers is that they come with a performance penalty — picking a suitable profiler is therefore, like mostly any other programming problem, a trade-off. Luckily, as with virtually any other general problem you might face as a Python developer, the Python standard library contains a module to deal with profiling, the aptly named profile module, which, especially through the C extension alternative cProfile, provides a sufficiently fast deterministic profiler that won’t hurt you too much to use. If the slowdown is too big or you don’t need a profile of every single invocation, you can choose to only profile a sample batch of invocations rather than all of them.

profile is by far the most commonly used profiling library for Python, and as a consequence, you can find a variety of modules for different frameworks. Common for these are however, that they are often injected somewhere down the pipeline of execution within the framework itself, where — to get the full picture of what’s going on — it’d be kind of nice to profile everything the framework does, just to make sure we’re not missing anything. Thanks to WSGI, this is super easy, as we can just wrap the WSGI application in a function that calls profile.runctx and saves the profiling result in a file. As with anything else, make sure you do these kinds of things in production so you get a realistic picture of what is actually going on — your development setup is not necessarily representative of the real world.

Digging through the dirt

With our raw profiling data in hand, it’s time to start digging through what’s going on. The pstats module, that accompanies profile, can easily provide us with a list of all the methods executed in the profile, and how much time was spent calling the functions both in terms of the internal time — the time spent actually doing things in the function itself — as well as the cumulative time — the total time spent running the function and any subroutines — both of which we can sort the profile by to get an understanding on what we’re spending our time doing:

What you’ll probably notice is, that it’s pretty hard to tell from these truncated results, what’s going on, and it doesn’t get much better when you have the list of all 709 functions. The basic problem here is, that there is alot of things going on, and lists just don’t do it justice — we need to see what causes the cumulative times we’re seeing. Luckily, though, there are a bunch of awesome tools out there to make our data much more understandable. One of the most versatile, and my personal favourite, is Gprof2dot which can take a profiler result and create a Dot graph which we can then render to for example a PNG image:

The result is a graph of all functions color coded by the cumulative time spent in the function:

Profiling graph for an average search request

While it may be overwhelming at first, tracing around the graph will lead you to one simple conclusion: 89.46 % of time is spent rendering the page!

Profiling graph node for rendering

In search of a better templating engine

89.46 % is a lot, but it’s not at all surprising to anyone who’s worked with Django for some time. It’s a well-known fact, that the templating engine included in Django is dog slow. Once again, we’re in luck, though, as a syntactically very similar but even more awesome alternative exists: Jinja2. By employing a just in time compilation approach, Jinja2 is capable of rendering templates crazy fast.

There are a few modules out there for integrating Jinja2 into Django, but we decided to go for coffin. An hour later, the results were clear: we had dropped the page generation time from 91 to 29 milliseconds — a 68 % drop. But, while we’re at it, why not have a look at what’s now going on and see if we can possibly find more time? The graph below is our 29-millisecond response:

Profiling graph for an average search request with Jinja2 templating

While everything is looking a lot better and a lot more even now, one thing still stands out:

Profiling graph node for escaping

We’re spending 29.16 % of time escaping strings. Considering how Github recently wrote a whole C library to do blisteringly fast string escapingHoudini, this wasn’t a total surprise either. However, it turns out that coffin tries to emulate the default Django behaviour by escaping everything, which we did not need at all. Switching off auto-escaping and spending a little time checking all our output statements, we saw yet another significant drop in page generation time: an average search now takes only 20 milliseconds.

Looking at the final execution graph, we could probably do a little more to optimise the template, but we’re getting close to the edge of what quick fixes can do:

Profiling graph for an average search request with Jinja2 templating and no auto-escaping

Fast code is fast

By any standards, a 77 % reduction in page generation time is a significant achievement. I hope it’s obvious from the above, that if you go about things methodologically, you don’t even need to restructure any major data components or rewrite the logic of your application to achieve these kinds of improvements.

However, realistically, we haven’t disproven the old database fallacy. If you do not invest time into thinking about data structures, caching etc. and you wind up doing 30 queries every page load instead adding up to a couple of seconds, that 71 millisecond gain might be much closer to 1 than 77 %. The point of it all is, that you should optimise in the right places at the right time, but don’t think it stops at the database.

Photo credit http://commons.wikimedia.org/wiki/File:Sebastian_Vettel_2012_Bahrain.jpg

Write a Comment

Comment