I’m not really sure what this post will be about yet, other than how effective and interesting I’m finding monitoring services outside what one might think the core tools are. I’ll be mostly talking about SaaS, but without a doubt it applies to all. You may be thinking — that’s obvious, of course you should be monitoring your software, but I’m talking more about unique-to-your-app sort of monitoring.
To start, there are what I’d consider ‘core tools’ and what we currently use: raw logging (papertrail), performance monitoring (new relic, skylight), downtime alerts (new relic synthetics + pager duty), and bug tracking (we love Bugsnag).
But there’s much more than just bugs, uptime and logs. Your application is unique and it has unique measurements that are important to be keeping a close eye on.
For this sort of stuff we’ve started using Librato. It’s awesome. I’m sure there are other tools out there like it, but we’ve been very happy with what Librato has to offer so far.
Maybe not-so-unique measurements
Here’s a few of the not-so-unique to our app measurements we’ve been keeping track of that other startups should probably be measuring too:
Looks like we have a few activities with very large fan-outs and the rest tend to be low fan-out.
Our enqueued jobs spike during some heavy data syncs, but we’re making progress.
It’s important that our critical jobs run quickly.
Unique to You
The above type data I’d say is important to most startups — but where Librato and similar services get interesting are when you start measuring anything and everything you want to unique to your business. We’re just getting started on this at Handshake, but here’s a few examples we have so far:
A simple one is number of pending duplicate employers in the system. We want to always make sure we are on top of these and merging duplicate employer accounts (looks like we’re a little behind).
We use elasticsearch for search. We provide very “wide” searching options to our users when searching over users. This means that reindexing users in elasticsearch can be expensive.
Handshake let’s users parse their resumes to quickly build their profile. Keeping track of success rates is important (the spike is from a large hackathon we helped run in which we parsed users resumes for them before they logged in).
Last but not least — alerting. Librato let’s you set up alerts based on absolute thresholds or relative change.
We have alerts for:
A simple heartbeat that is sent every 30 seconds. If we stop receiving the heartbeat, background jobs aren’t being run. Also alerts for large number of enqueued jobs, large number of failed resume parsings, high login failure rate and slow user reindexing. We certainly plan to add more.
Originally posted on Medium