Not a Python Programmer

If someone asks you if you’re a Python (or Java, or Javascript or language X) programmer, how do you respond? This year, my answer will be “No. I’m not a Python programmer, I’m a programmer”

Last year, I wrote production code in:

  • Python
  • Ruby
  • Javascript
  • Java
  • Objective-C
  • C
  • Powerpoint (only joking!)

I narrowly avoided writing code in Erlang, and considered writing some getting-to-know you Golang, but alas writing all the other code ate the available time.

As one of my Comp Sci professors said many years ago, learning any language takes a day or two provided you’re familiar with the language style (functional, imperative, OO etc etc). What takes the time is learning the libraries and the idiom. However, I’ve noticed that this takes way less time than it did back in the day, thanks to Google, Github and Stack Overflow.

Now, if you have a “there must be a better way” feeling when working with a new language, it’s the work of moments to find it and learn the relevant idiom. As an aside: one of the great disappointments of last year was realising how shockingly bad the idioms for writing network code on Android are.

So I’m a programmer first and foremost, who uses whichever tool makes sense while trying to cultivate a constant sense of laziness and curiosity. Laziness so I look for the easy way, and the curiosity to find it.

Let’s see which languages make the list in 2014…

OS X Mavericks: Fixing broken Android dev environments

OS X Mavericks has also broken my Android development environment by removing ant.

The fix is simple: download and install ant from http://ant.apache.org/bindownload.cgi

While I was at it, I also updated to the latest Android SDK from developer.android.com. Not necessary, but why not while fixing things.

OS X Mavericks: Fixing broken Python development environments

Upgrading to OS X Mavericks (10.9) broke my Python development environment by deleting the contents of  /Library/Python/2.7/site-packages and X11. Why an OS upgrade should be deleting things I put there is another question, but here’s how to get back up.

My Mac Python setup is reasonably standard I think: I use a system-wide install of some basic stuff (virtualenv, pip, easy_install) and then virtualenvs for all development.

After upgrading, existing virtualenvs were fine – the problem started when I wanted to create a fresh one. Suddenly virtualenv wasn’t there, and once virtualenv had been restored pip found it impossible to build several of the libraries that I depend on (including Pillow, lxml, gevent – basically anything with a C library dependency).

Step 1: Get virtualenv back

cd /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/command
python easy_install.py virtualenv

Step 2: Create a virtualenv and try installing the modules you need

virtualenv <path>
pip install -r <your requirements file>

If all installs fine, then you're done! If like me you get errors, here's some of the things that are probably missing:

Step 3: Reinstall X11 (required for Pillow / PIL)

Download from XQuartz.org

Step 4: Reinstall XCode command line tools

Install the latest XCode (currently 5.0.1) then install the command line tools:

xcode-select --install

Step 5: Repair homebrew

If you're using Homebrew for access to various open source packages - and I strongly recommend you do - then you may need to repair things.

brew update
brew upgrade
brew doctor

and follow the recommendations

After all this, you should find you're back where you were before Mavericks, and able to build and install Python modules. Happy hacking!

Why Tellybug moved from Cassandra to Amazon DynamoDB

Tellybug has recently moved our NoSQL database from Cassandra to Amazon Dynamo. In this post, I’ll explain why, how and what the results have been.

In the beginning was Cassandra

At Tellybug, we’d been using Cassandra since early 2011, back in the 0.7 days. Our usage of Cassandra was somewhat atypical, as we didn’t need its ability to store vast quantities of data, but its ability to write fast. The Tellybug platform powers the apps for big shows such as X Factor and Britain’s Got Talent, so when the show is on the system is under incredible loads. We have tens of thousands of writes per second to perform, and Cassandra offered the ability to scale to meet that need.

The other unusual aspect of the Tellybug platform is the swings in load between when a show is on and when there is no show being broadcast. Load spikes of over 100x are not uncommon, which means that we have to scale up and down, because running a system at 100x the capacity you need is rarely economic.

Cassandra worked well for us initially, but over time we ran into various problems:

  • Reliability
  • Counters
  • Operations

In the end it was reliability that triggered our move, but we were already feeling the pain from counters and operations.

We never managed to make Cassandra counters perform well. We needed a large number of increments to a few counters, and to be able to read the current count. Scaling the throughput on an individual counter is hard, and there’s a direct write/read tradeoff (if you need > 1 node to handle the count, then reads get slow as they have to ask all the nodes involved). No matter what we tried, counters behaved unpredictably, lost count (when an error is returned, the client doesn’t know if the increment has succeeded or not, making for a world of lose) and frequently caused long latency or load spikes across the cluster.

Operationally, the ability to scale Cassandra up improved over the years, until with the latest virtual nodes code scale up was relatively easy. However, scale down remained a slow, manual and error prone operation.  Cassandra’s streaming of data to/from nodes joining or leaving the ring has (or certainly had) a bunch of failure modes that can leave things in a mess (see here for some of them), and we experienced both repairable failures and data loss during decommission operations, requiring restores from backups.  When the cluster was 4-8 nodes that was annoying but manageable. When we were trying to go from 32 to 4 to 32 nodes, it wasn’t.

Where things go very wrong

The final straw was a reliability issue. For reasons that we were never adequately able to diagnose, our production cluster got into a state where any attempt to read data caused a cluster death (100% CPU load on all nodes).  Note this was on all nodes – exactly the sort of thing that a distributed system is supposed to not suffer from. Worse, this was during run of a major TV show, so we had only 5 days to figure out what was going wrong and fix. Neither we nor Acunu (whose distribution we were using) were able to figure out what had happened. Since a write only DB isn’t very useful, this triggered a crash move to an alternative.

(I’m not criticising Acunu here – they worked very hard to try to figure out what had gone wrong, but we didn’t have the time to get to the bottom of it)

Why Dynamo?

We’d been looking at Dynamo since the beta trials stage, and always liked the idea of fast scalability both up and down, with zero operational burden on us. However, the early versions had been missing some features we felt were essential, so we hadn’t moved.

Luckily by the time our Cassandra system failed, Dynamo had everything we needed. We considered installing a different NoSQL system, but for us the zero operational overhead with Dynamo, coupled with the rapid scalability was a killer feature. Now all we had to do was move there – in 5 days.

Moving to Dynamo

The first challenge with moving to Dynamo was getting a working development environment and set of tools. Unlike Cassandra, there was no way to run Dynamo locally for development. However, there are various mock implementations of Dynamo – we settled on ddbmock , and AWS have just announced a local version of DynamoDB. To access Dynamo or ddbmock, the excellent boto library provided a Python API to Dynamo.

With boto and ddbmock, we could start porting over the code from Cassandra to Dynamo. This proved to be remarkably simple, since both are Key-Value stores and support a very similar set of operations.

We migrated one Cassandra column family at a time, updating all the code that talked to that CF and also writing migrations to move the data from Cassandra to Dynamo. The first priority was to move counters out of Cassandra. It took us almost exactly 5 days from the first commit to replacing all our counters with a Dynamo implementation, restoring the system’s ability to function for the weekend’s show. 

More column families followed each week, and within a month everything had migrated to Dynamo. On the way, we’d also been able to clean up some schemas and kill off some old CFs that were no longer in use.

One of the hardest parts was migrating existing data from Cassandra while the system was live, as we couldn’t afford downtime to migrate the data. The worst area for this is user data, where a user can turn up at any point and their records have to be available.

A pattern we adopted was to change the application code to check Dynamo first for the data, and if not found fall back to Cassandra. If it finds it in Cassandra, then it is re-saved to Dynamo. This online migration was supplemented by a batch process that chewed through the Cassandra tables, copying any data that was missing.

Critical to this is making the batch process restartable and tolerant to failures to read or write to either system. For example, when migrating counts, doing the migration twice results in double counting, so the migration script has to record whether each counter had been migrated or not.

Once the migration had completed, a further update to the application code removes the fallback path to Cassandra, and at that point the original CF can be deleted.

Wins

The new Dynamo system has several wins over the old Cassandra one:

  • Counters that actually count. As mentioned above, Cassandra counters were a never ending source of pain for us, whereas Dynamo counters are reliably low-latency and support as many increments as we have ever tried to throw at them
  • Predictable, low latency. In use under loads as high as 30K operations/s, Dynamo’s latency is low (typically sub 10ms) and stable.
  • Graceful handling of overload. With Cassandra, if the system ever tried to execute too many operations/s, the cluster performance would degrade with long latency spikes and eventual client timeouts. With Dynamo, if we exceed the provisioned throughput we get a predictable, quick ThroughputExceeded error – and no other requests are affected. This is critical for a heavily loaded site – with thousands of requests/s, even brief latency spikes cause queues to build rapidly.
  • Scale up and down is super simple. Once we’d written a script to get round Dynamo’s restriction on only doubling throughput in a single request, scale up became an effortless single command. What was previously a multi-step, multi-hour process to scale up a Cass cluster is now a 1 line command and wait for 10 minutes or so. Scale down is even better – essentially instantaneous.
    Oh, and while the provisioned throughput changes Dynamo remains responsive and low latency – unlike Cassandra.
  • Set operations. The ability to add / remove an item from a set proved useful for a few schemas we had, replacing more complex code
  • Zero operational cost. Once we’d setup some weekly backup jobs, we have had to do exactly nothing to manage our database. No monitoring disk space, no checking for CPU/memory usage, no replacing failed nodes, no running nodetool repair.
  • Cost savings. Running Dynamo with our load patterns represents a substantial saving over running an equivalent Cassandra cluster on EC2.

Losses

Everything has its downsides, and there are a couple of things from Cassandra that I miss:

  • Ability to run a local install of our NoSQL store. This has delayed our adoption of some newer Dynamo features because ddbmock does not yet support the v2 Dynamo API. However, if we really needed the features we’d add the support to ddbmock – but the need has yet to be strong enough. Now that Amazon have released a local version of Dynamo, this limitation should be gone.
  • Unlimited data size for a specific key. Cassandra can support essentially unlimited amount of data (OK, the limit is the disk space on a node) per key, whereas Dynamo limits it to 64KB. That makes some schemas a bit more tricky, or requires some thought about how to handle overflow. But we haven’t run into anywhere where we couldn’t work around the limit – and most places we’re never even close.

Final score

We’re very happy with the results of our move to Dynamo. We’ve gone from a situation of spending considerable time and energy managing and worrying about our NoSQL DB to barely having to think about it. Which means more time for developing the features that our users actually care about, and less time spent wrangling clusters.

Dynamo has been rock solid for us since we migrated, and the support from the AWS team is first class. We’ve scaled up and down hundreds of times without issue, and never not been able to get the capacity we want, when we want it.

There’s still features I’d love to see in Dynamo:

  • Scale-up by arbitrary amounts via the web UI, rather than current maximum of 2x. We’ve scripted round this, but it feels like something that could be provided by Dynamo directly
  • Timed scaling. We know when a show’s going to be on, so I’d love to be able to say “at 7pm on Saturday, scale this table to 10K write operations”. Integrating this with Opsworks would be even better.
  • Load based scaling. Scaling throughput on a table based on consumed usage, with graceful scale up if throughput is exceeded, and scale down if we’re paying for too much capacity. Again, integrating this with Opsworks would be lovely – so that Dynamo resource can be managed like EC2 ones.
  • More useful metric graphs in the AWS console. The current ones aren’t granular enough to show up spikes causing ThroughputExceeded errors – so you can have a graph that looks fine, but the system is seeing errors. Plus the latency graphs are frequently completely empty

Luckily it’s Christmas soon, so maybe the AWS Santa will deliver…

Making Django runserver more awesome

I’ve recently found Dave Cramer’s awesome devserver and it already feels like an essential part of my Django toolkit. It’s a drop-in replacement for the default Django development server

./manage.py runserver

But with a whole host of seriously useful additions if you care about what's going on behind the scenes:

  • Log of SQL queries issued
  • Cache hits/misses
  • Werkzeug  debugger
  • Memory usage

And loads more. A picture's worth a thousand words, so here goes:


[sql] SELECT ...
 FROM "polls_numericpoll"
 INNER JOIN "polls_event" ON ("polls_numericpoll"."event_id" = "polls_event"."id")
 WHERE ("polls_event"."slug" = doi
 AND "polls_numericpoll"."enabled" = TRUE
 AND "polls_event"."enabled" = TRUE)
 ORDER BY "polls_event"."start_time" DESC, "polls_numericpoll"."order" DESC LIMIT 1
 [sql] SELECT ...
 FROM "polls_numericpoll"
 INNER JOIN "polls_event" ON ("polls_numericpoll"."event_id" = "polls_event"."id")
 WHERE "polls_numericpoll"."id" = 2011343691
 [sql] 2 queries with 0 duplicates
 [cache] (3ms) 32 calls made with a 83% hit percentage (5 misses)
[05/Mar/2012 17:08:27] "GET /api/1.0/show/doi/state HTTP/1.1" 200 1179 (time: 0.15s; sql: 0ms (2q))

[cache] (1ms) 7 calls made with a 83% hit percentage (1 misses)
[05/Mar/2012 17:09:21] "GET /api/1.0/show/doi/state HTTP/1.1" 200 1179 (time: 0.01s; sql: 0ms (0q))

The [sql] lines show how the DB was hit on the first pass, and the second request shows that the caching is working perfectly - everything is then served from cache. Having tracked down several scaling issues where something was secretly doing database calls, this is just great for quickly identifying what's going on. The best bit? Installation is trivially simple:

  1. Install via pip
    pip install git+git://github.com/dcramer/django-devserver#egg=django-devserver
  2. Add to INSTALLED_APPS
    INSTALLED_APPS = ( 'devserver', ....)

After that, there's a bunch of useful extensions to make things even better - head over to the project page for all the details.

Sniffing iOS and Android HTTP traffic

Sometimes when you’re debugging a problem with a remote server, only seeing the actual bits on the wire is good enough. Recently I needed to do this to confirm whether it was my app or the server that was causing a bug.

In the past I’ve used HTTPScoop and Wireshark to debug this sort of problem, but recently I’ve discovered a much better option for anything that’s passing over HTTP – which these days is most things.

My new go-to tool is Charles. This is a great cross-platform (Mac, Windows, Linux) logging HTTP proxy with all sorts of nice features – including SSL proxying, so you can see what’s going to/from Facebook or other web services over https!

Here’s a step by step on getting Charles working with your iOS or Android phone, including the SSL proxying. Continue reading

Raspberry Pi – but not till May

When the Raspberry Pi was announced, I went and registered with both Farnell & RS for one. Alas I wasn’t one of the first 10,000, so it was waiting list time.

On Thursday I got the mail saying Farnell would now take my order. Yay!

Order placed, I eagerly awaited the delivery notification. Would it be a week? Maybe two? Try nearer 8 – it’s due to arrive on the 30th April 2012. I hope it’s worth the wait.