Handshake ships new code multiple times a day, continuously. This fast deployment pace allows us to build and maintain features quickly and safely, but also pushes the build process to its limits. One of these limits is the test suite. When doing continuous delivery, a fast test suite becomes paramount. As our CI continued to slow with new features and CI checks, we started to look into alternatives to our hosted CI solution. After an extensive and exhaustive search we made the decision to switch to Buildkite for our CI pipelines.
Slow Test Suites
Handshake has grown quite a bit over the past few years. We’ve seen our build time go from 5 minutes on one container, to 15 minutes, back to 5 minutes on many containers, to upwards of 25 minutes on 6 containers at it’s most lengthy build time. Although 25 minutes may not seem like a lot, it can feel like an eternity to an engineer in the flow. Quickly, 25 minutes per change can add up to an hour to the time-to-production; 25 minutes for the feature branch, and 25 more after merging into master. If we wanted to continue to move quickly, we knew we needed to address the problem and cut our test suite time drastically.
Why so slow?
When evaluating our test suite time there were a few obvious bottlenecks.
Because we use some custom libraries not provided by our previous hosted CI, we needed to install new libraries on each build. Often times these could be cached, but not always, and as a result our fixed up-front cost for any build was 7 minutes of installing libraries. These custom libraries ranged from new ruby versions that were not yet supported by the provider to upgraded qt for capybara-webkit to fix other issues such as flakey tests. The rest of the build process took around 18 minutes, most of that time being spent running the actual tests.
This meant that there were two ways to improve our build time. First, we wanted to cut the upfront cost as much as possible. We hoped to have an image with exactly the libraries relevant to us that we could control at a more granular level and re-use for every build. We also knew that by using more containers (parallelization) we could continue to cut down the build time; more containers = faster tests.
When we initially found Buildkite, it was unclear if they provided enough to be a viable solution. Buildkite’s approach to CI is “bring your own compute”. This means that although they provide the UI for viewing your tests, seeing tests in-progress, editing your pipelines, and managing account settings. All you need to provide is the compute and infrastructure to actually run the tests. At first this feels like a steep task, but with great open source provided by Buildkite such as the elastic-ci-stack, it’s not nearly as difficult as it seems.
We quickly started experimenting with buildkite. Smaller services could be configured and passing within a few hours, and we had complete control over our CI environment. As we started moving more apps to Buildkite we quickly began to discover many of the benefits.
Fast Compute, Please
Exact Docker Images
Next, we took advantage of the docker-compose plugin supported by the beta build for buildkite. This plugin allows us to build the exact docker image we need for our tests and re-use that image later when running the actual tests. By using dedicated instances for building the docker images we can take advantage of docker’s caching functionality.
This means that for a normal build our upfront cost for our docker images is merely the time it takes to upload and download the changed code from Docker Hub, and a few initializations of services like postgres schema load. In most cases, our test containers start running tests in about 1.5 minutes - including docker image upload/download, starting services like postgres, schema load and precompiling all assets. The majority of that time (roughly one minute) is spent in uploading changed layers to docker hub, a time we hope to cut down even further.
All the Containers!
Suddenly we were living in a new type of world. Instead of each container spending 7 minutes before it can run any tests, our dedicated test-running containers spend merely 30 seconds before they are being productive. This means that containers cycle in and out much faster, and even with roughly the same number of containers on our old provider we can parellize our Buildkite builds much more without the queue backing up. Previously we were running only 6 containers per build, and the queue would backup on a regular basis. With buildkite we are running 16 containers in parallel for our Rspec tests, plus 3 other containers for some additional CI, and our queue rarely backs up because tests run quickly and move on!
Keeping Costs Down
Although cost savings was not a primary goal for our transition, we were able to keep costs less than with our previous provider. In fact, our first month costs were almost exactly the same as the previous provider. The ability to do this was particularly nice (and somewhat of a surprise to us) given our considerably faster build time and larger number of instances per build. Sixteen containers is quite a bit more than six!
We were able to accomplish this through a few means. Firstly, auto scaling of our test infrastructure meant that we aren’t spending money on instances when we don’t need them (at night and during weekends). Secondly, we use spot instances which often provide a considerably cheaper price than normal on-demand. Lastly, the increased efficiency of our builds (in particular, re-use of docker images) meant that we were transitively making more efficient use of our money.
What does all of this mean for our build times? Our builds now finish in around 9 minutes! This has had drastic positive benefits:
- No longer do engineers have to context switch when waiting for CI
- Hotfixes can get out considerably faster, usually in around 20 minutes when running both feature branch CI and master CI.
We’ve also seen benefits through using Buildkite’s more expressive and customizable pipeline structure. We can get feedback on other parts of CI, such as brakeman and bundle-audit, in less than three minutes.
Tips for Migration
Are you thinking about giving Buildkite a try? Here’s a few things that worked well for us.
- Use elastic-ci-stack: It gives you a highly scalable, easy to manage CI infrastructure in minutes.
- Use the docker-compose plugin provided by the beta build of buildkite.
- Make sure you use EC2 instances with strong specs. Not only do they speed up your builds, they also make them more reliable.
- Make sure that the whole team is on board. Switching CI’s is no small task, and ensure that your new CI is clearly at a point where it is better than your previous. If coworkers are still using the old CI while you’re moving towards the new one, there’s probably a reason.
- Use spot instances for cheaper EC2 instances, but make sure to bid high enough so builds don’t lose their instances.
Switching to Buildkite has been an exciting and productive process. Although running your own build infrastructure comes with a maintenance cost, the benefits across the rest of your team can be huge. Being in a position of continuing to cut buildtime by simply adding more compute is powerful. Looking forward we plan to continue to cut down the fixed cost by reducing docker image size (if possible) and optimizing our docker layer structure.
Interested in learning more about our development environment? We are hiring! http://joinhandshake.com/careers/