At Handshake we heavily rely on automated testing to protect against regressions and to give us confidence and speed when writing and deploying code. Some of the technologies we use to test our code include RSpec, Capybara, FactoryGirl, and occasionally Selenium. As our application grew over time and we moved towards a component-based architecture, we saw an increase in flaky tests. Flaky tests are harmful because they can erode the trust you have in your test suite. When you lose trust in your test suite you might shrug off a test failure as just being flaky, when in fact there is a problem going on. Since we rely so heavily on automated testing and we didn’t want to lose this trust, we began implementing various tools and methods to help avoid tests from being flaky. These are the steps that we’ve taken to help reduce flakiness in our test suite. Your results may vary.
The first step that we took revolved around preventing flaky tests from even
happening in the first place. This included increasing knowledge around how to
write great tests and providing helper functions to help avoid common pitfalls.
Increasing knowledge involved everyone documenting best practices so that we
could share lessons we’ve learned. One simple solution we’ve used is overriding
the default wait time of
have_content. By default Capybara waits two seconds
before timing out, but for some long-running actions increasing this to four
seconds, for example, will give the action enough time to finish. Increasing the
default time is much better than adding something like
sleep 4 to your tests,
sleep 4 will always take four seconds, where overriding the default
to four seconds will at most take four seconds.
To avoid common pitfalls we’ve built numerous helper functions to make writing
great tests easier. A common pitfall when using a component-based architecture
involves not waiting properly for an AJAX event to finish. With Capybara you can
expect(page).to have_content('foo'), which will continue checking the page
for the content “foo” for up to two seconds by default. This provides a
loosely-coupled way of waiting for AJAX events to finish, which is sufficient
for most cases, but there are some scenarios where checking for content changes
alone won’t guarantee an AJAX call has finished.
An example of this in Handshake is a dropdown option we have to approve an
employer. When you click approve we send an AJAX request to our servers and mark
the employer as approved in the UI. If the approval isn’t successful, we show an
error message to the user and revert the UI back to showing the employer as
pending. In this scenario, checking
wouldn’t guarantee that the action completed successfully. All AJAX requests in
Handshake are wrapped in a custom class (so we can show a loading bar for all
actions), so we’ve written the following helper method to ensure the action has
def wait_for_handshake_ajax_to_finish(wait_time = Capybara.default_max_wait_time) starting_workers_created = page.evaluate_script('Handshake.operations.workers_created()') yield Timeout.timeout(wait_time) do loop do workers_created = page.evaluate_script('Handshake.operations.workers_created()') active_workers = page.evaluate_script('Handshake.operations.workers()') break if workers_created > starting_workers_created and active_workers.zero? end end end
This helper records the starting count of workers (i.e. actions), yields to a block where the action is performed, and then loops until the worker count has changed (meaning the action was started) and all running workers have finished (meaning the action completed). At first we implemented a way to wait for AJAX events to complete (similar to the method described here), but we were running into race conditions where the action would complete before we started waiting, which would result in a timeout error. Recording the starting worker count and waiting for it to change helped us avoid this race condition.
The next step that we took to help avoid flaky tests was to randomize some of
the fields that we use in our tests, as well randomizing the order that we run
tests. The two main fields that we randomize are the
id field and the
field (on records that have a time zone). Randomizing the ids helps us catch any
areas that we might be looking up records by id using ids from the wrong table.
Randomizing time zones helps us ensure that our site behaves properly no matter
what time zone a user might be in. This especially comes in handy for testing our
appointment scheduling system, ensuring that a student studying abroad in
Beijing can schedule an appointment with an advisor back in the US, for example.
Randomizing the order that tests are run helps ensure that tests aren’t inadvertently relying on each other. At first we saw a rise in build failures due to tests affecting each other, but little-by-little we would fix these flakes, which helped improve the overall quality of our tests. In particular, randomizing the order helps avoid issues where objects aren’t cleaned up properly after a test finishes, Elasticsearch indexes aren’t reset, or caches aren’t invalidated. In general, our test suite is structured in a way such that it will do this cleanup automatically and ensure that each test starts with a clean slate, but there are some special cases that rely on the developer to do this themselves.
The last step that we took was building an internal tool to monitor our test suite. We run our test suite on CircleCI, so this internal tool processes all of the data that they send to us after every build finishes. With this data we do a few different things. Since we have CircleCI split our test suite up and run it in parallel we get our code coverage percentages per build container. As a result, the first thing we do is merge the code coverage data from each container, which gives us our overall code coverage percentage. This allows us to set goals and track our progress over time. Another thing that we do is track test failures over time. Tracking failures over time allows us to monitor how our test quality is trending and helps us quickly determine whether or not a test has a history of failing. When it detects that a test is flaky it will create an internal incident, assign it to the author of the commit, and will alert them via Slack. This helps us ensure that flaky tests are fixed as soon as possible, so we can continue to have confidence in our test suite.
By randomizing our tests, building a tool to monitor and report flaky tests, and adding helper functions to make it easy to avoid flaky tests, we’ve seen an overall rise in the quality and reliability of our test suite. In fact, our build success rate increased by 38% over the past 6 months. We see far less flaky tests now, and when they do come up, they’re handled quickly. This allows our development team to continue trusting our test suite so we can keep iterating quickly and confidently while avoiding regressions along the way.