Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmBuilding dashboards for operational visibility | Amazon Builders' Library2020-08-09T22:04:28+00:00
https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/
jmdashboards aws monitoring metrics alerts amazonhttps://pinboard.in/https://pinboard.in/u:jm/b:bf2682c8188c/Actual screenshot of the broken UX of the Hawaii ballistic missile alert system2018-01-16T15:27:08+00:00
https://twitter.com/CivilBeat/status/953127914618302464
jm"This is the screen that set off the ballistic missile alert on Saturday. The operator clicked the PACOM (CDW) State Only link. The drill link is the one that was supposed to be clicked."
This is terrible, terrible UX.]]>ux ui hawaii alerting alerts testing safety failhttps://pinboard.in/https://pinboard.in/u:jm/b:c281422fb791/The likely user interface which led to Hawaii's false-alarm incoming-ballistic-missile alert on Saturday 2018-01-132018-01-15T11:07:00+00:00
https://twitter.com/supersat/status/952612571122630659
jmtesting ux user-interfaces fail eas hawaii false-alarms alerts nuclear early-warning human-errorhttps://pinboard.in/https://pinboard.in/u:jm/b:a4d8ff23c728/My Philosophy on Alerting2016-07-14T14:53:47+00:00
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
jm. Seem pretty reasonable]]>monitoring sysadmin alerting alerts nagios pager ops sre rob-ewaschukhttps://pinboard.in/https://pinboard.in/u:jm/b:6c245de62c9b/Alarm design: From nuclear power to WebOps2015-11-11T17:31:02+00:00
http://humanisticsystems.com/2015/10/16/fit-for-purpose-questions-about-alarm-system-design-from-theory-and-practice/
jmImagine you are an operator in a nuclear power control room. An accident has started to unfold. During the first few minutes, more than 100 alarms go off, and there is no system for suppressing the unimportant signals so that you can concentrate on the significant alarms. Information is not presented clearly; for example, although the pressure and temperature within the reactor coolant system are shown, there is no direct indication that the combination of pressure and temperature mean that the cooling water is turning into steam. There are over 50 alarms lit in the control room, and the computer printer registering alarms is running more than 2 hours behind the events.
This was the basic scenario facing the control room operators during the Three Mile Island (TMI) partial nuclear meltdown in 1979. The Report of the President’s Commission stated that, “Overall, little attention had been paid to the interaction between human beings and machines under the rapidly changing and confusing circumstances of an accident” (p. 11). The TMI control room operator on the day, Craig Faust, recalled for the Commission his reaction to the incessant alarms: “I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information”. It was the first major illustration of the alarm problem, and the accident triggered a flurry of human factors/ergonomics (HF/E) activity.
A familiar topic for this ex-member of the Amazon network monitoring team...]]>ergonomics human-factors ui ux alarms alerts alerting three-mile-island nuclear-power safety outages opshttps://pinboard.in/https://pinboard.in/u:jm/b:1c46b439b712/Should Airplanes Be Flying Themselves?2014-11-14T14:57:02+00:00
http://www.vanityfair.com/business/2014/10/air-france-flight-447-crash#
jmairlines automation flight flying accidents post-mortems af447 air-france autopilot alerts pilots team-leaders clipper-skippers alternate-lawhttps://pinboard.in/https://pinboard.in/u:jm/b:404d98e6a267/Applying cardiac alarm management techniques to your on-call2014-09-01T09:34:20+00:00
http://fractio.nl/2014/08/26/cardiac-alarms-and-ops/
jmops monitoring sysadmin alerts alarms nagios alarm-fatigue false-positives pageshttps://pinboard.in/https://pinboard.in/u:jm/b:89a98e187218/Dead Man's Snitch2014-04-08T14:05:41+00:00
https://deadmanssnitch.com/
jma cron job monitoring tool that keeps an eye on your periodic processes and notifies you when something doesn't happen. Daily backups, monthly emails, or cron jobs you need to monitor? Dead Man's Snitch has you covered. Know immediately when one of these processes doesn't work.
via Marc.]]>alerts cron monitoring sysadmin ops backups alarmshttps://pinboard.in/https://pinboard.in/u:jm/b:a15af0101f4f/The How and Why of Flapjack2014-01-02T22:38:56+00:00
http://holmwood.id.au/~lindsay/2014/01/03/the-how-and-why-of-flapjack/
jmFlapjack aims to be a flexible notification system that handles:
Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc);
Alert summarisation (with per-user, per media summary thresholds);
Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc).
Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.
]]>flapjack notification alerts ops nagios paging sensuhttps://pinboard.in/https://pinboard.in/u:jm/b:a43e8cb847a1/My Philosophy on Alerting2013-05-17T17:16:33+00:00
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#
jmmonitoring ops devops alerting alerts pager-duty via:jkhttps://pinboard.in/https://pinboard.in/u:jm/b:f216b0ed5cb5/