Scaling to 30K: Tsung

The first problem when building something to scale is testing it. Apache Bench (‘ab’) and JMeter are well known tools that can simulate loads, but both run into issues when simulating large numbers of users.

Tsung is a load testing tool that we used, and it scales effortlessly. Running across a few EC2 nodes, Tsung was easily able to generate tens of thousands of requests per second – and provide nice graphs to show what was going on.

Some notes from my experience with it:

  • Tsung is largely CPU bound on EC2, and due to the way Erlang SMP works it seems it’s not worth running on multiple core machines. Luckily m1.small nodes are cheap and single core – so we simply used more of these
  • If you’re seeing error_connect_emfile or other Erlang errors, you’re probably hitting Linux resource user limits. Editing /etc/security/limits.conf to increase the nofile limit to 50000 or so will solve the problem
  • We used Chef to automate creation of Tsung nodes, which makes scaling up the load generation trivially easy
  • Keep a close eye on the Tsung nodes to make sure they haven’t run out of steam. Often what looks like your system hitting a scaling limit is actually the Tsung nodes hitting their limits. Tell-tale signs are CPU hitting 100%, and the user generation rate fluctuating

Overall Tsung made generating a load of 30K requests/s a relatively simple process (at least compared with making the system cope with that load!). I don’t know why it’s not more widely used – I can only guess that being written in Erlang makes it seem a bit off-putting. But on a modern Linux system it compiled and installs without issue, and the Erlang underpinnings means it scales horizontally entirely smoothly.

About these ads

3 responses to “Scaling to 30K: Tsung

  1. Another option is to use bees with machine guns: https://github.com/newsapps/beeswithmachineguns

    • Yes, I’ve tried bees with machine guns. It works, but because it uses apachebench underneath it has only limited ability to generate different sorts of traffic, and the reporting/graphing of the results just isn’t there. At the 10′s of K/s rate, being able to track exactly when the various error conditions start to happen is important, as is spotting errors caused by the test harness itself. Ab isn’t great at this.

  2. How many small EC2 instances did you use to reach the 30k req/s? I’m hitting a limit of around 250 req/s per small instance

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s