Tag Archives: cassandra

Why Tellybug moved from Cassandra to Amazon DynamoDB

Tellybug has recently moved our NoSQL database from Cassandra to Amazon Dynamo. In this post, I’ll explain why, how and what the results have been.

In the beginning was Cassandra

At Tellybug, we’d been using Cassandra since early 2011, back in the 0.7 days. Our usage of Cassandra was somewhat atypical, as we didn’t need its ability to store vast quantities of data, but its ability to write fast. The Tellybug platform powers the apps for big shows such as X Factor and Britain’s Got Talent, so when the show is on the system is under incredible loads. We have tens of thousands of writes per second to perform, and Cassandra offered the ability to scale to meet that need.

The other unusual aspect of the Tellybug platform is the swings in load between when a show is on and when there is no show being broadcast. Load spikes of over 100x are not uncommon, which means that we have to scale up and down, because running a system at 100x the capacity you need is rarely economic.

Cassandra worked well for us initially, but over time we ran into various problems:

  • Reliability
  • Counters
  • Operations

In the end it was reliability that triggered our move, but we were already feeling the pain from counters and operations.

We never managed to make Cassandra counters perform well. We needed a large number of increments to a few counters, and to be able to read the current count. Scaling the throughput on an individual counter is hard, and there’s a direct write/read tradeoff (if you need > 1 node to handle the count, then reads get slow as they have to ask all the nodes involved). No matter what we tried, counters behaved unpredictably, lost count (when an error is returned, the client doesn’t know if the increment has succeeded or not, making for a world of lose) and frequently caused long latency or load spikes across the cluster.

Operationally, the ability to scale Cassandra up improved over the years, until with the latest virtual nodes code scale up was relatively easy. However, scale down remained a slow, manual and error prone operation.  Cassandra’s streaming of data to/from nodes joining or leaving the ring has (or certainly had) a bunch of failure modes that can leave things in a mess (see here for some of them), and we experienced both repairable failures and data loss during decommission operations, requiring restores from backups.  When the cluster was 4-8 nodes that was annoying but manageable. When we were trying to go from 32 to 4 to 32 nodes, it wasn’t.

Where things go very wrong

The final straw was a reliability issue. For reasons that we were never adequately able to diagnose, our production cluster got into a state where any attempt to read data caused a cluster death (100% CPU load on all nodes).  Note this was on all nodes – exactly the sort of thing that a distributed system is supposed to not suffer from. Worse, this was during run of a major TV show, so we had only 5 days to figure out what was going wrong and fix. Neither we nor Acunu (whose distribution we were using) were able to figure out what had happened. Since a write only DB isn’t very useful, this triggered a crash move to an alternative.

(I’m not criticising Acunu here – they worked very hard to try to figure out what had gone wrong, but we didn’t have the time to get to the bottom of it)

Why Dynamo?

We’d been looking at Dynamo since the beta trials stage, and always liked the idea of fast scalability both up and down, with zero operational burden on us. However, the early versions had been missing some features we felt were essential, so we hadn’t moved.

Luckily by the time our Cassandra system failed, Dynamo had everything we needed. We considered installing a different NoSQL system, but for us the zero operational overhead with Dynamo, coupled with the rapid scalability was a killer feature. Now all we had to do was move there – in 5 days.

Moving to Dynamo

The first challenge with moving to Dynamo was getting a working development environment and set of tools. Unlike Cassandra, there was no way to run Dynamo locally for development. However, there are various mock implementations of Dynamo – we settled on ddbmock , and AWS have just announced a local version of DynamoDB. To access Dynamo or ddbmock, the excellent boto library provided a Python API to Dynamo.

With boto and ddbmock, we could start porting over the code from Cassandra to Dynamo. This proved to be remarkably simple, since both are Key-Value stores and support a very similar set of operations.

We migrated one Cassandra column family at a time, updating all the code that talked to that CF and also writing migrations to move the data from Cassandra to Dynamo. The first priority was to move counters out of Cassandra. It took us almost exactly 5 days from the first commit to replacing all our counters with a Dynamo implementation, restoring the system’s ability to function for the weekend’s show. 

More column families followed each week, and within a month everything had migrated to Dynamo. On the way, we’d also been able to clean up some schemas and kill off some old CFs that were no longer in use.

One of the hardest parts was migrating existing data from Cassandra while the system was live, as we couldn’t afford downtime to migrate the data. The worst area for this is user data, where a user can turn up at any point and their records have to be available.

A pattern we adopted was to change the application code to check Dynamo first for the data, and if not found fall back to Cassandra. If it finds it in Cassandra, then it is re-saved to Dynamo. This online migration was supplemented by a batch process that chewed through the Cassandra tables, copying any data that was missing.

Critical to this is making the batch process restartable and tolerant to failures to read or write to either system. For example, when migrating counts, doing the migration twice results in double counting, so the migration script has to record whether each counter had been migrated or not.

Once the migration had completed, a further update to the application code removes the fallback path to Cassandra, and at that point the original CF can be deleted.


The new Dynamo system has several wins over the old Cassandra one:

  • Counters that actually count. As mentioned above, Cassandra counters were a never ending source of pain for us, whereas Dynamo counters are reliably low-latency and support as many increments as we have ever tried to throw at them
  • Predictable, low latency. In use under loads as high as 30K operations/s, Dynamo’s latency is low (typically sub 10ms) and stable.
  • Graceful handling of overload. With Cassandra, if the system ever tried to execute too many operations/s, the cluster performance would degrade with long latency spikes and eventual client timeouts. With Dynamo, if we exceed the provisioned throughput we get a predictable, quick ThroughputExceeded error – and no other requests are affected. This is critical for a heavily loaded site – with thousands of requests/s, even brief latency spikes cause queues to build rapidly.
  • Scale up and down is super simple. Once we’d written a script to get round Dynamo’s restriction on only doubling throughput in a single request, scale up became an effortless single command. What was previously a multi-step, multi-hour process to scale up a Cass cluster is now a 1 line command and wait for 10 minutes or so. Scale down is even better – essentially instantaneous.
    Oh, and while the provisioned throughput changes Dynamo remains responsive and low latency – unlike Cassandra.
  • Set operations. The ability to add / remove an item from a set proved useful for a few schemas we had, replacing more complex code
  • Zero operational cost. Once we’d setup some weekly backup jobs, we have had to do exactly nothing to manage our database. No monitoring disk space, no checking for CPU/memory usage, no replacing failed nodes, no running nodetool repair.
  • Cost savings. Running Dynamo with our load patterns represents a substantial saving over running an equivalent Cassandra cluster on EC2.


Everything has its downsides, and there are a couple of things from Cassandra that I miss:

  • Ability to run a local install of our NoSQL store. This has delayed our adoption of some newer Dynamo features because ddbmock does not yet support the v2 Dynamo API. However, if we really needed the features we’d add the support to ddbmock – but the need has yet to be strong enough. Now that Amazon have released a local version of Dynamo, this limitation should be gone.
  • Unlimited data size for a specific key. Cassandra can support essentially unlimited amount of data (OK, the limit is the disk space on a node) per key, whereas Dynamo limits it to 64KB. That makes some schemas a bit more tricky, or requires some thought about how to handle overflow. But we haven’t run into anywhere where we couldn’t work around the limit – and most places we’re never even close.

Final score

We’re very happy with the results of our move to Dynamo. We’ve gone from a situation of spending considerable time and energy managing and worrying about our NoSQL DB to barely having to think about it. Which means more time for developing the features that our users actually care about, and less time spent wrangling clusters.

Dynamo has been rock solid for us since we migrated, and the support from the AWS team is first class. We’ve scaled up and down hundreds of times without issue, and never not been able to get the capacity we want, when we want it.

There’s still features I’d love to see in Dynamo:

  • Scale-up by arbitrary amounts via the web UI, rather than current maximum of 2x. We’ve scripted round this, but it feels like something that could be provided by Dynamo directly
  • Timed scaling. We know when a show’s going to be on, so I’d love to be able to say “at 7pm on Saturday, scale this table to 10K write operations”. Integrating this with Opsworks would be even better.
  • Load based scaling. Scaling throughput on a table based on consumed usage, with graceful scale up if throughput is exceeded, and scale down if we’re paying for too much capacity. Again, integrating this with Opsworks would be lovely – so that Dynamo resource can be managed like EC2 ones.
  • More useful metric graphs in the AWS console. The current ones aren’t granular enough to show up spikes causing ThroughputExceeded errors – so you can have a graph that looks fine, but the system is seeing errors. Plus the latency graphs are frequently completely empty

Luckily it’s Christmas soon, so maybe the AWS Santa will deliver…

Speaking at Cassandra Europe

I’ll be speaking at Cassandra Europe on March 28th 2012. I’m in the “Case Studies” track, sharing our experiences using Cassandra to power high-load applications like X Factor and Britain’s Got Talent.

It looks like this is going to be a great conference for anyone using Cassandra in Europe, and I look forward to hearing more about what people are doing, and meeting some others using Cassandra in weird and wonderful ways.

Behind the scenes: Using Cassandra & Acunu to power Britain’s Got Talent

In some previous posts, I’ve talked about how we scaled Django up to cope with the loads for Britain’s Got Talent. One area I haven’t talked about yet is the database.

For BGT, we were planning for peak voting loads of 10,000 votes/second. Our main database runs on MySQL using the Amazon Relational Database Service. Early testing showed there was no way we could hit that level using RDS – we were maxing out at around 300 votes/s on an m1.large database instance. Even though there are larger instances, they’re not 30x bigger, so we knew we needed to do something different.

We knew that various NoSQL databases would be able to handle the write load, but the team had no experience in operating NoSQL clusters at scale. We had less than 2 weeks before first broadcast, and all the options available were both uncertain and high risk.

Then a mutual friend introduced us to Acunu. They not only know all about NoSQL, but have a production-grade Cassandra stack using their unique storage engine that works on EC2. Tom and the team at Acunu quickly did some benchmarking on EC2 to show that the write volume we were expecting would be easily handleable, as well as testing out the Python bindings for Cassandra. That gave us good confidence that this could easily scale to the loads we were expecting, with plenty of headroom if things went mental.

We wired Cassandra into our stack, and started load testing against a 2-node Cassandra cluster. While we’d originally expected to need more nodes, we found that the cluster was easily able to absorb the load we were testing with, thanks to the optimisations in the Acunu stack.

So how did it all go? Things were tense as the first show was broadcast and we saw the load starting to ramp up, but the Acunu cluster worked flawlessly. As we came towards the start of the live shows, we were totally comfortable that it was all working well.

Then AWS told us that the server hosting one of the Cassandra instances was degraded and might die at any point. Just before the first live finals. We weren’t too worried as adding a new node to a cluster is a simple operation. We duly fired up a new EC2 instance and added it to the cluster.

Then things went wrong. For some reason, the new node didn’t integrate properly into the cluster and now we had a degraded cluster that couldn’t be brought back online. And only a few hours until showtime. I love live TV!

The team at Acunu were fantastic in supporting us (including from a campsite in France!) both to set up a new cluster and to diagnose the problem with the degraded cluster. For the show, we switched over to the new cluster as we still hadn’t been able to figure out what was wrong with the old one (it turned out to be a rare bug in Cassandra).

Thankfully the shows went off without a hitch and no-one saw the interesting juggling act going on to keep the service running.

So a big thank you to the team at Acunu for their help “behind the scenes” at BGT – we couldn’t have done it without them.