Tellybug has recently moved our NoSQL database from Cassandra to Amazon Dynamo. In this post, I’ll explain why, how and what the results have been.
In the beginning was Cassandra
At Tellybug, we’d been using Cassandra since early 2011, back in the 0.7 days. Our usage of Cassandra was somewhat atypical, as we didn’t need its ability to store vast quantities of data, but its ability to write fast. The Tellybug platform powers the apps for big shows such as X Factor and Britain’s Got Talent, so when the show is on the system is under incredible loads. We have tens of thousands of writes per second to perform, and Cassandra offered the ability to scale to meet that need.
The other unusual aspect of the Tellybug platform is the swings in load between when a show is on and when there is no show being broadcast. Load spikes of over 100x are not uncommon, which means that we have to scale up and down, because running a system at 100x the capacity you need is rarely economic.
Cassandra worked well for us initially, but over time we ran into various problems:
In the end it was reliability that triggered our move, but we were already feeling the pain from counters and operations.
We never managed to make Cassandra counters perform well. We needed a large number of increments to a few counters, and to be able to read the current count. Scaling the throughput on an individual counter is hard, and there’s a direct write/read tradeoff (if you need > 1 node to handle the count, then reads get slow as they have to ask all the nodes involved). No matter what we tried, counters behaved unpredictably, lost count (when an error is returned, the client doesn’t know if the increment has succeeded or not, making for a world of lose) and frequently caused long latency or load spikes across the cluster.
Operationally, the ability to scale Cassandra up improved over the years, until with the latest virtual nodes code scale up was relatively easy. However, scale down remained a slow, manual and error prone operation. Cassandra’s streaming of data to/from nodes joining or leaving the ring has (or certainly had) a bunch of failure modes that can leave things in a mess (see here for some of them), and we experienced both repairable failures and data loss during decommission operations, requiring restores from backups. When the cluster was 4-8 nodes that was annoying but manageable. When we were trying to go from 32 to 4 to 32 nodes, it wasn’t.
Where things go very wrong
The final straw was a reliability issue. For reasons that we were never adequately able to diagnose, our production cluster got into a state where any attempt to read data caused a cluster death (100% CPU load on all nodes). Note this was on all nodes – exactly the sort of thing that a distributed system is supposed to not suffer from. Worse, this was during run of a major TV show, so we had only 5 days to figure out what was going wrong and fix. Neither we nor Acunu (whose distribution we were using) were able to figure out what had happened. Since a write only DB isn’t very useful, this triggered a crash move to an alternative.
(I’m not criticising Acunu here – they worked very hard to try to figure out what had gone wrong, but we didn’t have the time to get to the bottom of it)
We’d been looking at Dynamo since the beta trials stage, and always liked the idea of fast scalability both up and down, with zero operational burden on us. However, the early versions had been missing some features we felt were essential, so we hadn’t moved.
Luckily by the time our Cassandra system failed, Dynamo had everything we needed. We considered installing a different NoSQL system, but for us the zero operational overhead with Dynamo, coupled with the rapid scalability was a killer feature. Now all we had to do was move there – in 5 days.
Moving to Dynamo
The first challenge with moving to Dynamo was getting a working development environment and set of tools. Unlike Cassandra, there was no way to run Dynamo locally for development. However, there are various mock implementations of Dynamo – we settled on ddbmock , and AWS have just announced a local version of DynamoDB. To access Dynamo or ddbmock, the excellent boto library provided a Python API to Dynamo.
With boto and ddbmock, we could start porting over the code from Cassandra to Dynamo. This proved to be remarkably simple, since both are Key-Value stores and support a very similar set of operations.
We migrated one Cassandra column family at a time, updating all the code that talked to that CF and also writing migrations to move the data from Cassandra to Dynamo. The first priority was to move counters out of Cassandra. It took us almost exactly 5 days from the first commit to replacing all our counters with a Dynamo implementation, restoring the system’s ability to function for the weekend’s show.
More column families followed each week, and within a month everything had migrated to Dynamo. On the way, we’d also been able to clean up some schemas and kill off some old CFs that were no longer in use.
One of the hardest parts was migrating existing data from Cassandra while the system was live, as we couldn’t afford downtime to migrate the data. The worst area for this is user data, where a user can turn up at any point and their records have to be available.
A pattern we adopted was to change the application code to check Dynamo first for the data, and if not found fall back to Cassandra. If it finds it in Cassandra, then it is re-saved to Dynamo. This online migration was supplemented by a batch process that chewed through the Cassandra tables, copying any data that was missing.
Critical to this is making the batch process restartable and tolerant to failures to read or write to either system. For example, when migrating counts, doing the migration twice results in double counting, so the migration script has to record whether each counter had been migrated or not.
Once the migration had completed, a further update to the application code removes the fallback path to Cassandra, and at that point the original CF can be deleted.
The new Dynamo system has several wins over the old Cassandra one:
- Counters that actually count. As mentioned above, Cassandra counters were a never ending source of pain for us, whereas Dynamo counters are reliably low-latency and support as many increments as we have ever tried to throw at them
- Predictable, low latency. In use under loads as high as 30K operations/s, Dynamo’s latency is low (typically sub 10ms) and stable.
- Graceful handling of overload. With Cassandra, if the system ever tried to execute too many operations/s, the cluster performance would degrade with long latency spikes and eventual client timeouts. With Dynamo, if we exceed the provisioned throughput we get a predictable, quick ThroughputExceeded error – and no other requests are affected. This is critical for a heavily loaded site – with thousands of requests/s, even brief latency spikes cause queues to build rapidly.
- Scale up and down is super simple. Once we’d written a script to get round Dynamo’s restriction on only doubling throughput in a single request, scale up became an effortless single command. What was previously a multi-step, multi-hour process to scale up a Cass cluster is now a 1 line command and wait for 10 minutes or so. Scale down is even better – essentially instantaneous.
Oh, and while the provisioned throughput changes Dynamo remains responsive and low latency – unlike Cassandra.
- Set operations. The ability to add / remove an item from a set proved useful for a few schemas we had, replacing more complex code
- Zero operational cost. Once we’d setup some weekly backup jobs, we have had to do exactly nothing to manage our database. No monitoring disk space, no checking for CPU/memory usage, no replacing failed nodes, no running nodetool repair.
- Cost savings. Running Dynamo with our load patterns represents a substantial saving over running an equivalent Cassandra cluster on EC2.
Everything has its downsides, and there are a couple of things from Cassandra that I miss:
- Ability to run a local install of our NoSQL store. This has delayed our adoption of some newer Dynamo features because ddbmock does not yet support the v2 Dynamo API. However, if we really needed the features we’d add the support to ddbmock – but the need has yet to be strong enough. Now that Amazon have released a local version of Dynamo, this limitation should be gone.
- Unlimited data size for a specific key. Cassandra can support essentially unlimited amount of data (OK, the limit is the disk space on a node) per key, whereas Dynamo limits it to 64KB. That makes some schemas a bit more tricky, or requires some thought about how to handle overflow. But we haven’t run into anywhere where we couldn’t work around the limit – and most places we’re never even close.
We’re very happy with the results of our move to Dynamo. We’ve gone from a situation of spending considerable time and energy managing and worrying about our NoSQL DB to barely having to think about it. Which means more time for developing the features that our users actually care about, and less time spent wrangling clusters.
Dynamo has been rock solid for us since we migrated, and the support from the AWS team is first class. We’ve scaled up and down hundreds of times without issue, and never not been able to get the capacity we want, when we want it.
There’s still features I’d love to see in Dynamo:
- Scale-up by arbitrary amounts via the web UI, rather than current maximum of 2x. We’ve scripted round this, but it feels like something that could be provided by Dynamo directly
- Timed scaling. We know when a show’s going to be on, so I’d love to be able to say “at 7pm on Saturday, scale this table to 10K write operations”. Integrating this with Opsworks would be even better.
- Load based scaling. Scaling throughput on a table based on consumed usage, with graceful scale up if throughput is exceeded, and scale down if we’re paying for too much capacity. Again, integrating this with Opsworks would be lovely – so that Dynamo resource can be managed like EC2 ones.
- More useful metric graphs in the AWS console. The current ones aren’t granular enough to show up spikes causing ThroughputExceeded errors – so you can have a graph that looks fine, but the system is seeing errors. Plus the latency graphs are frequently completely empty
Luckily it’s Christmas soon, so maybe the AWS Santa will deliver…