When I first started at Handshake our daily active users count was in the hundreds. Over the course of a few months that number has grown by about 10x. Thankfully Heroku and its ecosystem has made the process of scaling with minimal investment in infrastructure super easy. Of course, any web application with a quickly growing user base begins to experience bottlenecks, and below are some of the issues that have been the culprit.
Mind your PGs and Q’s
One of the most common bottlenecks to watch for is the database. One of the first things you can do to help increase read performance is add better indices. We had one particularly complex query that was taking ~600ms to run on average, but with just the addition of the index the query time dropped to ~30ms (a 20x improvement).Throwing indices at everything isn’t always the best solution, due to the extra overhead of updating the indices on writes, but in many cases the performance gains are worth it.
Fortunately, Heroku has a great tool available to use within its Postgres offerings. The “Metrics” tab can help you quickly identify slow queries or queries with high throughput and optimize them.
Another issue that to watch for is a low cache hit rate. Cache hit rates should
be 99%+ in high-performing databases. As we began deploying to more and more
schools, and the overall size of our database grew, we simply didn’t have enough
RAM to support the size of our database. After vertically scaling our database
to have beefier specs, hit rate returned to normal and performance increased
significantly. To learn more about your database’s cache hit rate, try the
pg:diagnose command as well as viewing the metrics tab.
heroku pg:diagnose -a appname
Monitoring your Database
Your database is one of the most important things to monitor. We have set up several monitoring systems at Handshake, and we have several alerts set up to notify us of any issues. Some examples of database metrics we monitor are:
- Overall database load
- Database average query time
- Number of active connections
- SQL query counts, by query type
- Cache hit rate
- Index hit rate
- Blocked query counts
- Database pool percent used
In addition to monitoring, we alert on more critical items (blocked queries, high query time, and high database pool usage) so we can fix technical issues before they become user issues. Outside of alerting, one great side effect of collecting this information is the ability to go back and view graphs over time. You can see how deploys effect query counts, hit rates, etc. very easily.
Web Dyno Memory
PX Dynos are Dyno-mite
Back in the days of a few hundred users a day, the 2X dynos suited our needs very well. They were cheap, we could scale them up and down quickly, and they had sufficient memory to handle the load we were seeing. As time went on we handled the increase in load by simply increasing the number of dynos, and all was well and good.
Eventually though, we ran into some issues with memory. With the 2x dynos you are allotted 1 GB of memory, and we were running 2 web processes per dyno - for a total of 500 MB for each process. However, once you factor in the your application’s slug size you are left with minimal memory left for object allocations. Once you reach the soft limit for memory, Heroku will move into swap memory.
Avoid Swap Memory
Swap memory is dreadfully slow and can have a significant impact on your app’s performance. At one point, when we were running 8-12 2X dynos consistently, we began to explore using 1 or 2 of the new PX dynos rather than a large number of smaller dynos.
We instantly saw performance improvements.
With the PX dynos you get 14GB of memory, which allows us to run up to 30 web processes per dyno. This meant we could consolidate 14+ 2X dynos to just 1 PX dyno. Additionally, PX dynos are single tenant and have better CPU specs. The larger memory pool made it easier for us to scale up more dynos and also prevented us from quickly moving into using swap memory, which had a tremendous impact on response time.
Although the upgrade drastically decreased the frequency we hit swap memory, it does still happen on occasion. To help prevent the slow down of using swap space, we set up Neptune.io to automatically restart dynos based on the swap memory output in our logs. This was easily set up using the Papertrail webhook API and prevents dynos from using swap for more than a few seconds, keeping our response time snappy.
Scale Dynos Automatically
As I mentioned above, one big advantage of being deployed on Heroku is the ease of scaling dynos up and down at will. Not only is it extremely simple to manually scale the number of dynos you’re using, but using Heroku gives you access to many addons that can do this for you automatically based on certain factors, making it even easier.
For Handshake we currently use an addon called Adept Scale. Adept Scale gives you the ability to scale dynos up/down automatically based on the average response time, traffic, and other patterns. You can configure control factors such as how quickly dynos scale both up and down independently.
Configuring Auto Scaling
Despite the incredible ease of use of this addon, there were still some lessons to be learned. You have the ability to choose the rate in which you scale the number of dynos up (if response time is below 200 ms for 1 minute increase dyno count, etc) and the rate at which you scale back down the dynos. When we first installed the addon we had these numbers a little off, and were noticing a very sporadic number-of-dynos graph - basically we were scaling up and down way too quickly so even the slightest decrease in response time would cause us to spin up extra dynos- and we were quick to kill them as soon as that small decrease went away. This may not seem like a huge issue, since more dynos can only be a good thing! but it had a negative affect on both our wallet and our users. The scaling down of dynos was done in a not so graceful manner and would often result in an increased number of timeouts for the users who were communicating with the dynos being killed. In order to fix this issue we took a number of steps, including decreasing the sensitivity to small response time changes as well as increasing the window in which sampled the average. As a result, we now scale dynos up only when we really need the extra web servers, and we keep them around long enough to ensure the increased load has actually subsided.
Adept Scale is just one of the many addons available on Heroku. The availability and ease of use for installing addons allows you to explore and use a number of services with little pain. Of course with this ease of use comes some tradeoffs in terms of control. For example, the Heroku Postgres addon allows you to quickly spin up new databases, upgrade them when needed, and manage backups of those databases - but you lose on out some features such as the ability to create your own database user roles. While this trade off might be small, and for the considerable ease of use probably it’s worth it, it is something to keep in mind if you need more complex setups of these services provided by addons.
One gotcha with being deployed on Heroku is that they reserve the right to kill dynos at their discretion. This isn’t a problem for most scenarios, the dynos will simply restart and everyone will go on their merry way, but we did run into some issues with some long running background jobs being killed.
Two lessons to be learned here:
- First, try to avoid long running jobs. Break the tasks into smaller pieces to help prevent them from being killed and also to help prevent long running tasks from backing up your worker queue (if they can be run in parallel, even better!).
- Second, make sure your workers are idempotent. Idempotent workers will have the same effect no matter how many times they are run. This is probably the more important of the two, since even a short running task has a chance at being killed, and performing work twice can have embarrassing consequences.
Overall, Heroku is an excellent platform for quickly getting a web application up and running. It takes care of infrastructure challenges and decisions for you and lets you focus on building your application. It gives you the ability to monitor and improve performance in important areas with valuable metrics. Through the ‘dyno’ concept, it allows for simple horizontal scaling of your application. Lastly, there are a large amount of addons available in their ecosystem to take advantage of.
This is just a small subset of how we scale on Heroku. Watch for more information in the future!
Posted on Medium