Fall signifies the start of the school year, and with that a renewed interest in the job search. At Handshake, it’s our busiest time of the year. This past summer, we decided to verify that our application could handle forecasted student traffic.
The goal was to make sure that the influx of new students joining and using our platform wouldn’t negatively impact our site. To accomplish that, I set out to learn about the systems behind the scenes, to understand how students use our app, and to apply those learnings to simulate student traffic to test our application. Now that Fall is over, I’ll explain what I did and what I learned along the way.
Getting started: monitoring our app
Before I could simulate student traffic, I needed to understand what that traffic looked like at the moment. In some ways, this was straightforward to do. Handshake is a Ruby on Rails application that runs on Heroku and AWS, and uses several different monitoring tools to understand real user activity.
Here are a few our team uses.
Where our requests spend all their time.
NewRelic automatically extracts a ton of information about our application at runtime and can build a decent picture of what’s going on. The two tabs I frequently toggle between are Transactions – time consuming areas – and External services – how much time is spent communicating with other services.
Our students love search.
200 vs non-200 HTTP responses codes.
We use Librato to track time and count specific activity within our application. NewRelic paints an outline of our application, but we use Librato to measure specific areas and see the finer details.
I spent most of my time looking at NewRelic and our Librato dashboards, asking questions and double checking assumptions with colleagues. Up to this point, I hadn’t spent as much time using the application (as a student), but would end up doing so, right before writing performance test scripts.
In addition to these tools, we also have error monitoring and a centralized logging tool in place, which gives us complete view of our application’s landscape.
The performance test environment
Keeping the site operational is one of our top priorities, so load testing against our production environment was not an option. We decided to provision an entirely new environment which was representative of our production infrastructure where we could run our tests without the risk of any customer-facing side effects.
Since I had researched and discussed findings with other engineers on our team, I knew which services to include — anything related to search. There were a few services we skipped: the event service API and document conversion service API. I examined both and concluded they were simple to operate and reason about, in terms of performance.
If you’re hoping we had Chef/Puppet/Ansible/Terraform or other automated tools that would, with a single command, provision an entire production-like environment complete with seeded data, I am here to share with you that we don’t, and that most of my time was spent spinning up internal services (new PostgreSQL production leaders & followers, new Elasticsearch cluster, new Memcached instances, etc.) and wiring them together, manually. I also enabled monitoring tools for this new environment so we could collect and analyze results.
Then, we cloned production data.
Reproducing test data for performance testing is challenging. Algorithmically created data often fails to be diverse enough to discover performance problems. We ended up using a copy of real production data to get a reasonably accurate mix of students, schools, employers, and job postings. As students do a lot of search within our application, we also reindexed all the necessary content in Elasticsearch.
Finally, we disabled customer-facing features (Geocoding, Feature flagging, transactional mail etc.) that were unnecessary for performance testing.
Writing your first performance test
For writing and running performance tests, JMeter can’t be beat. Putting together a good test plan in JMeter requires a deep understanding of customer behavior, in this case, a student using a browser. For example, here is what’s happening when a student logs in.
- Opens their browser and visits https://app.joinhandshake.com/.
- Browser redirects to https://app.joinhandshake.com/login.
- Student types email address and presses enter.
- Student types password and presses enter.
- Browser issues a POST request with form parameters and CSRF token to our backend.
- The student is redirected to their dashboard.
Eventually, we ended up with several scenarios. Here’s an example of one:
Login Visit app and set cookies Get CSRF token from HTML with a regular expression Login with email_address, password, and CSRF token View job postings Visit jobs page Search for a job with the /postings.json endpoint and query params
Which I translated into the following JMeter test plan:
I added a random delay timer that simulates user activity by introducing a gap between each step.
I’m also using a Throughput Shaping Timer to define specific throughput goals.
In this test plan, I’m starting out with 6 requests per second (RPS), ramping up to 24 RPS over 10 minutes, and holding that throughput for another 8 minutes.
Running your first performance test
My process is to run test plans locally for a couple of minutes as that tends to uncover configuration issues. Once I’m comfortable with the results, I’ll upload the plan to a cloud-based performance testing services and run them for 30, 60, and 90 minutes. I don’t usually run 24 hour tests because most issues are uncovered at 30 and 90 minute intervals–and also because it’s expensive. Our monitoring tools collect a lot of data after each run so we can understand how our application behaved.
Following each test, I attempt to answer the following questions:
- Did request queue times increase?
- What were the 95th- and 99th-percentile response times?
- Was there a spike in 4xx or 5xx HTTP response codes?
- What did overall CPU utilization, network traffic, PostgreSQL look like?
- Where there any anomalies in our logs or error monitoring tools?
- Did we DDoS anything?
- Given Little’s Law, how should we tune the number of processes and threads on the next test run?
When tests produced strange results, I’d spend time to discover the root cause, fix it, and re-run the test. When tests went well, I’d change one of the many variables available and re-run the test again. I continued this performance-evaluate-fix-repeat loop until we were confident our application could handle Fall-level traffic.
What’s in the future
There are a number of areas I’d love to improve next time around.
In the near term, I’m looking forward to spending more time optimizing our frontend experience so that search is truly snappy.
Our performance test results were scattered across multiple tools. Having everything seamlessly available in fewer tools or a single Log would be valuable, even for our daily site reliability responsibilities.
Finally, over time, I think it would be interesting to truly think about performance as a feature, and to have a production environment on which we can run and replay requests.
Verifying that our application could handle new student traffic took a significant amount of work, but was one of the reasons we were able to enter Fall confidently.
I spent a lot of time poring over our monitoring tools, talking to colleagues, and, of course, using our application.
Overall, it was a great challenge and a fun learning experience too – did I forget to mention I started this on my 2nd week on the job?
How can I help you
Okay, let’s shift gears a bit. Here’s a list of resources for getting started with your own performance test.
- Find and learn a load testing tool — it doesn’t have to be JMeter — here are some command line tools
- Various tools I recommend:
- A few papers and books on the subject:
- Finally, a handful of tips for improving your application’s performance profile.
- If you’re making a network call, double check that the underlying transport has set timeout values for REQUEST_TIMEOUT and OPEN_TIMEOUT.
- Performance degradations occur when we overuse our resources. Put a time limit, a memory limit, or a constraint around that work.
- Launching threads manually with
new Thread()? Maybe use a fixed size thread pool.
- Using an in-memory cache? Consider an LRU cache.
- Reading a large CSV files? Check out the low level IO methods for iterating through the file.
I hope you find these useful.
Thanks to Chris Schmitz for reviewing this post before its publication.