Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmIntroducing Kale « Code as Craft2013-06-17T09:52:14+00:00
http://codeascraft.com/2013/06/11/introducing-kale/
jmat Etsy, we really love to make graphs. We graph everything! Anywhere we can slap a StatsD call, we do. As a result, we’ve found ourselves with over a quarter million distinct metrics. That’s far too many graphs for a team of 150 engineers to watch all day long! And even if you group metrics into dashboards, that’s still an awful lot of dashboards if you want complete coverage. Of course, if a graph isn’t being watched, it might misbehave and no one would know about it. And even if someone caught it, lots of other graphs might be misbehaving in similar ways, and chances are low that folks would make the connection.
We’d like to introduce you to the Kale stack, which is our attempt to fix both of these problems. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem.
It'll be interesting to see if they can get this working well. I've found it can be tricky to get working with low false positives, without massive volume to "smooth out" spikes caused by normal activity. Amazon had one particularly successful version driving severity-1 order drop alarms, but it used massive event volumes and still had periodic false positives. Skyline looks like it will alarm on a single anomalous data point, and in the comments Abe notes "our algorithms err on the side of noise and so alerting would be very noisy."]]>etsy monitoring service-metrics alarming deviation correlation data search graphs oculus skyline kale false-positiveshttps://pinboard.in/https://pinboard.in/u:jm/b:540ae39b9e4d/