Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmOutages, PostMortems, and Human Error 1012015-04-04T09:32:24+00:00
http://www.slideshare.net/jallspaw/etsy-codeascraft-allspaw1
jmdevops monitoring ops five-whys allspaw slides etsy codeascraft incident-response incidents severity root-cause postmortems outages reliability techops tier-one-supporthttps://pinboard.in/https://pinboard.in/u:jm/b:1b513074b706/Final Root Cause Analysis and Improvement Areas: Nov 18 Azure Storage Service Interruption2014-12-21T23:18:28+00:00
http://azure.microsoft.com/blog/2014/12/17/final-root-cause-analysis-and-improvement-areas-nov-18-azure-storage-service-interruption/
jmroot-cause azure outages postmortem cloud microsoft deploymenthttps://pinboard.in/https://pinboard.in/u:jm/b:9e35cdbf126f/Paper: "Root Cause Detection in a Service-Oriented Architecture" [pdf]2013-06-17T10:05:53+00:00
http://www.sigmetrics.org/sigmetrics2013/pdfs/p93.pdf
jm
This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean
average precision in finding root causes compared to baseline and current state-of-the-art methods.
This is a topic close to my heart after working on something similar for 3 years in Amazon!
Looks interesting, although (a) I would have liked to see more case studies and examples of "real world" outages it helped with; and (b) it's very much a machine-learning paper rather than a systems one, and there is no discussion of fault tolerance in the design of the detection system, which would leave me worried that in the case of a large-scale outage event, the system itself will disappear when its help is most vital. (This was a major design influence on our team's work.)
Overall, particularly given those 2 issues, I suspect it's not in production yet. Ours certainly was ;)]]>linkedin soa root-cause alarming correlation service-metrics machine-learning graphs monitoringhttps://pinboard.in/https://pinboard.in/u:jm/b:3867e176e952/First 5 Minutes Troubleshooting A Server2013-03-13T21:36:30+00:00
http://devo.ps/blog/2013/03/06/troubleshooting-5minutes-on-a-yet-unknown-box.html
jmdstat server io disks hardware performance linux sysadmin ops troubleshooting checklists root-causehttps://pinboard.in/https://pinboard.in/u:jm/b:98a01d0fe721/