In some previous posts, I’ve talked about how we scaled Django up to cope with the loads for Britain’s Got Talent. One area I haven’t talked about yet is the database.
For BGT, we were planning for peak voting loads of 10,000 votes/second. Our main database runs on MySQL using the Amazon Relational Database Service. Early testing showed there was no way we could hit that level using RDS – we were maxing out at around 300 votes/s on an m1.large database instance. Even though there are larger instances, they’re not 30x bigger, so we knew we needed to do something different.
We knew that various NoSQL databases would be able to handle the write load, but the team had no experience in operating NoSQL clusters at scale. We had less than 2 weeks before first broadcast, and all the options available were both uncertain and high risk.
Then a mutual friend introduced us to Acunu. They not only know all about NoSQL, but have a production-grade Cassandra stack using their unique storage engine that works on EC2. Tom and the team at Acunu quickly did some benchmarking on EC2 to show that the write volume we were expecting would be easily handleable, as well as testing out the Python bindings for Cassandra. That gave us good confidence that this could easily scale to the loads we were expecting, with plenty of headroom if things went mental.
We wired Cassandra into our stack, and started load testing against a 2-node Cassandra cluster. While we’d originally expected to need more nodes, we found that the cluster was easily able to absorb the load we were testing with, thanks to the optimisations in the Acunu stack.
So how did it all go? Things were tense as the first show was broadcast and we saw the load starting to ramp up, but the Acunu cluster worked flawlessly. As we came towards the start of the live shows, we were totally comfortable that it was all working well.
Then AWS told us that the server hosting one of the Cassandra instances was degraded and might die at any point. Just before the first live finals. We weren’t too worried as adding a new node to a cluster is a simple operation. We duly fired up a new EC2 instance and added it to the cluster.
Then things went wrong. For some reason, the new node didn’t integrate properly into the cluster and now we had a degraded cluster that couldn’t be brought back online. And only a few hours until showtime. I love live TV!
The team at Acunu were fantastic in supporting us (including from a campsite in France!) both to set up a new cluster and to diagnose the problem with the degraded cluster. For the show, we switched over to the new cluster as we still hadn’t been able to figure out what was wrong with the old one (it turned out to be a rare bug in Cassandra).
Thankfully the shows went off without a hitch and no-one saw the interesting juggling act going on to keep the service running.
So a big thank you to the team at Acunu for their help “behind the scenes” at BGT – we couldn’t have done it without them.