Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmRoblox 73-hour outage write-up2022-01-24T10:56:22+00:00
https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/
jmThe root cause was due to two issues. Enabling a relatively new streaming feature on Consul under unusually high read and write load led to excessive contention and poor performance. In addition, our particular load conditions triggered a pathological performance issue in BoltDB. The open source BoltDB system is used within Consul to manage write-ahead-logs for leader election and data replication.
Also worth noting: 'We are working to move to multiple availability zones and data centers.']]>postmortems outages roblox games ops consul boltdb replication uptimehttps://pinboard.in/https://pinboard.in/u:jm/b:3f5e3d936d4b/AWS Post-Event Summaries2019-09-02T10:02:51+00:00
https://aws.amazon.com/premiumsupport/technology/pes/
jmpostmortems post-mortems aws ops outages availabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:c093617316e2/The secret life of DNS packets: investigating complex networks2019-05-23T15:04:48+00:00
https://stripe.com/gb/blog/secret-life-of-dns
jmOne more surprising detail we discovered in the tcpdump data was that the VPC resolver was not sending back responses to many of the queries. During one of the 60-second collection periods the DNS server sent 257,430 packets to the VPC resolver. The VPC resolver replied back with only 61,385 packets, which averages to 1,023 packets per second. We realized we may be hitting the AWS limit for how much traffic can be sent to a VPC resolver, which is 1,024 packets per second per interface.
]]>aws limits ops stripe vpc dns outages postmortems debugginghttps://pinboard.in/https://pinboard.in/u:jm/b:12605b81e624/OVH suffer 24-hour outage (The Register)2017-07-17T16:44:02+00:00
https://www.theregister.co.uk/2017/07/13/watercooling_leak_killed_vnx_array/
jmpostmortems ovh outages liquid-cooling datacenters dr disaster-recovery opshttps://pinboard.in/https://pinboard.in/u:jm/b:066ed3ddff21/Etsy Debriefing Facilitation Guide2016-11-18T15:09:42+00:00
https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf
jmetsy postmortems blameless ops production debriefing ebookshttps://pinboard.in/https://pinboard.in/u:jm/b:7540547b93d9/Google Cloud Status2016-04-14T10:14:23+00:00
https://status.cloud.google.com/incident/compute/16007?post-mortem
jmAt 14:50 Pacific Time on April 11th, our engineers removed an unused GCE IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network. By itself, this sort of change was harmless and had been performed previously without incident. However, on this occasion our network configuration management software detected an inconsistency in the newly supplied configuration. The inconsistency was triggered by a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management. In attempting to resolve this inconsistency the network management software is designed to ‘fail safe’ and revert to its current configuration rather than proceeding with the new configuration. However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.
One of our core principles at Google is ‘defense in depth’, and Google’s networking systems have a number of safeguards to prevent them from propagating incorrect or invalid configurations in the event of an upstream failure or bug. These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
]]>multi-region outages google ops postmortems gce cloud ip networking cascading-failures bugshttps://pinboard.in/https://pinboard.in/u:jm/b:b632669d92be/Google tears Symantec a new one on its CA failure2015-10-29T10:31:21+00:00
https://googleonlinesecurity.blogspot.ca/2015/10/sustaining-digital-certificate-security.html
jmMore immediately, we are requesting of Symantec that they further update their public incident report with:
A post-mortem analysis that details why they did not detect the additional certificates that we found.
Details of each of the failures to uphold the relevant Baseline Requirements and EV Guidelines and what they believe the individual root cause was for each failure.
We are also requesting that Symantec provide us with a detailed set of steps they will take to correct and prevent each of the identified failures, as well as a timeline for when they expect to complete such work. Symantec may consider this latter information to be confidential and so we are not requesting that this be made public.
]]>google symantec ev ssl certificates ca security postmortems opshttps://pinboard.in/https://pinboard.in/u:jm/b:a46a883af124/A collection of postmortems2015-08-10T16:45:01+00:00
https://github.com/danluu/post-mortems
jmpostmortems ops uptime reliabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:9a079ba2cb35/Mikhail Panchenko's thoughts on the July 2015 CircleCI outage2015-07-21T10:20:42+00:00
http://blog.mihasya.com/2015/07/19/thoughts-evoked-by-circleci-outage.html
jmdatabase-is-not-a-queue mysql sql databases ops outages postmortemshttps://pinboard.in/https://pinboard.in/u:jm/b:689da69282e3/Outages, PostMortems, and Human Error 1012015-04-04T09:32:24+00:00
http://www.slideshare.net/jallspaw/etsy-codeascraft-allspaw1
jmdevops monitoring ops five-whys allspaw slides etsy codeascraft incident-response incidents severity root-cause postmortems outages reliability techops tier-one-supporthttps://pinboard.in/https://pinboard.in/u:jm/b:1b513074b706/The Infinite Hows, instead of the Five Whys2014-11-19T11:24:38+00:00
http://radar.oreilly.com/2014/11/the-infinite-hows.html
jm“Why?” is the wrong question.
In order to learn (which should be the goal of any retrospective or post-hoc investigation) you want multiple and diverse perspectives. You get these by asking people for their own narratives. Effectively, you’re asking “how?“
Asking “why?” too easily gets you to an answer to the question “who?” (which in almost every case is irrelevant) or “takes you to the ‘mysterious’ incentives and motivations people bring into the workplace.”
Asking “how?” gets you to describe (at least some) of the conditions that allowed an event to take place, and provides rich operational data.
]]>ops five-whys john-allspaw questions postmortems analysis root-causeshttps://pinboard.in/https://pinboard.in/u:jm/b:42ee2f441d36/Box Tech Blog » A Tale of Postmortems2014-08-18T11:38:10+00:00
http://tech.blog.box.com/2014/08/a-tale-of-postmortems/
jmThe picture was getting clearer, and we decided to look into individual postmortems and action items and see what was missing. As it was, action items were wasting away with no owners. Digging deeper, we noticed that many action items entailed massive refactorings or vague requirements like “make system X better” (i.e. tasks that realistically were unlikely to be addressed). At a higher level, postmortem discussions often devolved into theoretical debates without a clear outcome. We needed a way to lower and focus the postmortem bar and a better way to categorize our action items and our technical debt.
Out of this need, PIE (“Probability of recurrence * Impact of recurrence * Ease of addressing”) was born. By ranking each factor from 1 (“low”) to 5 (“high”), PIE provided us with two critical improvements:
1. A way to police our postmortems discussions. I.e. a low probability, low impact, hard to implement solution was unlikely to get prioritized and was better suited to a discussion outside the context of the postmortem. Using this ranking helped deflect almost all theoretical discussions.
2. A straightforward way to prioritize our action items.
What’s better is that once we embraced PIE, we also applied it to existing tech debt work. This was critical because we could now prioritize postmortem action items alongside existing work. Postmortem action items became part of normal operations just like any other high-priority work.
]]>postmortems action-items outages ops devops pie metrics ranking refactoring prioritisation tech-debthttps://pinboard.in/https://pinboard.in/u:jm/b:849482d466c0/Twilio Billing Incident Post-Mortem2013-07-25T10:07:30+00:00
http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html
jmAt 1:35 AM PDT on July 18, a loss of network connectivity caused all billing redis-slaves to simultaneously disconnect from the master. This caused all redis-slaves to reconnect and request full synchronization with the master at the same time. Receiving full sync requests from each redis-slave caused the master to suffer extreme load, resulting in performance degradation of the master and timeouts from redis-slaves to redis-master.
By 2:39 AM PDT the host’s load became so extreme, services relying on redis-master began to fail. At 2:42 AM PDT, our monitoring system alerted our on-call engineering team of a failure in the Redis cluster. Observing extreme load on the host, the redis process on redis-master was misdiagnosed as requiring a restart to recover. This caused redis-master to read an incorrect configuration file, which in turn caused Redis to attempt to recover from a non-existent AOF file, instead of the binary snapshot. As a result of that failed recovery, redis-master dropped all balance data. In addition to forcing recovery from a non-existent AOF, an incorrect configuration also caused redis-master to boot as a slave of itself, putting it in read-only mode and preventing the billing system from updating account balances.
See also http://antirez.com/news/60 for antirez' response.
Here's the takeaways I'm getting from it:
1. network partitions happen in production, and cause cascading failures. this is a great demo of that.
2. don't store critical data in Redis. this was the case for Twilio -- as far as I can tell they were using Redis as a front-line cache for billing data -- but it's worth saying anyway. ;)
3. Twilio were just using Redis as a cache, but a bug in their code meant that the writes to the backing SQL store were not being *read*, resulting in repeated billing and customer impact. In other words, it turned a (fragile) cache into the authoritative store.
4. they should probably have designed their code so that write failures would not result in repeated billing for customers -- that's a bad failure path.
Good post-mortem anyway, and I'd say their customers are a good deal happier to see this published, even if it contains details of the mistakes they made along the way.]]>redis caching storage networking network-partitions twilio postmortems ops billing replicationhttps://pinboard.in/https://pinboard.in/u:jm/b:2e374b3b528b/