Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmCache stampede2022-09-21T10:25:52+00:00
https://en.wikipedia.org/wiki/Cache_stampede
jmcaching distcomp caches cache-stampedes thundering-herd dogpiling failures cascading-failures loadhttps://pinboard.in/https://pinboard.in/u:jm/b:03e259d12e81/Operating Apache Kafka Clusters 24/7 Without A Global Ops Team2019-10-02T10:00:01+00:00
https://eng.lyft.com/operating-apache-kafka-clusters-24-7-without-a-global-ops-team-417813a5ce70
jmautoremediation failures ops kafka scalability automationhttps://pinboard.in/https://pinboard.in/u:jm/b:457b3e9b4a93/Don Norman on "Human Error", RISKS Digest Volume 23 Issue 07 20032018-01-15T15:02:01+00:00
http://catless.ncl.ac.uk/Risks/23.07.html#subj10
jmIt is far too easy to blame people when systems fail. The result is that
over 75% of all accidents are blamed on human error. Wake up people! When
the percentage is that high, it is a signal that something else is at fault
-- namely, the systems are poorly designed from a human point of view. As I
have said many times before (even within these RISKS mailings), if a valve
failed 75% of the time, would you get angry with the valve and simply
continual to replace it? No, you might reconsider the design specs. You would
try to figure out why the valve failed and solve the root cause of the
problem. Maybe it is underspecified, maybe there shouldn't be a valve there,
maybe some change needs to be made in the systems that feed into the valve.
Whatever the cause, you would find it and fix it. The same philosophy must
apply to people.
]]>don-norman ux ui human-interface human-error errors risks comp.risks failureshttps://pinboard.in/https://pinboard.in/u:jm/b:dbe5f8ce9cce/toxy2015-08-28T11:19:51+00:00
https://github.com/h2non/toxy#latency
jmtoxy is a fully programmatic and hackable HTTP proxy to simulate server failure scenarios and unexpected network conditions. It was mainly designed for fuzzing/evil testing purposes, when toxy becomes particularly useful to cover fault tolerance and resiliency capabilities of a system, especially in service-oriented architectures, where toxy may act as intermediate proxy among services.
toxy allows you to plug in poisons, optionally filtered by rules, which essentially can intercept and alter the HTTP flow as you need, performing multiple evil actions in the middle of that process, such as limiting the bandwidth, delaying TCP packets, injecting network jitter latency or replying with a custom error or status code.
]]>toxy proxies proxy http mitm node.js soa network failures latency slowdown jitter bandwidth tcphttps://pinboard.in/https://pinboard.in/u:jm/b:4cc87b3a400a/Inside the sad, expensive failure of Google+2015-08-05T12:48:44+00:00
http://mashable.com/2015/08/02/google-plus-history/
jm"It was clear if you looked at the per user metrics, people weren’t posting, weren't returning and weren’t really engaging with the product," says one former employee. "Six months in, there started to be a feeling that this isn’t really working." Some lay the blame on the top-down structure of the Google+ department and a leadership team that viewed success as the only option for the social network. Failures and disappointing data were not widely discussed. "The belief was that we were always just one weird feature away from the thing taking off," says the same employee.
]]>google google+ failures post-mortems business facebook social-media fail bureaucracy vic-gundotrahttps://pinboard.in/https://pinboard.in/u:jm/b:ca5eee837484/Vaurien, the Chaos TCP Proxy — Vaurien 1.8 documentation2015-02-13T11:10:44+00:00
http://vaurien.readthedocs.org/en/1.8/index.html
jmVaurien is basically a Chaos Monkey for your TCP connections. Vaurien acts as a proxy between your application and any backend. You can use it in your functional tests or even on a real deployment through the command-line.
Vaurien is a TCP proxy that simply reads data sent to it and pass it to a backend, and vice-versa. It has built-in protocols: TCP, HTTP, Redis & Memcache. The TCP protocol is the default one and just sucks data on both sides and pass it along.
Having higher-level protocols is mandatory in some cases, when Vaurien needs to read a specific amount of data in the sockets, or when you need to be aware of the kind of response you’re waiting for, and so on.
Vaurien also has behaviors. A behavior is a class that’s going to be invoked everytime Vaurien proxies a request. That’s how you can impact the behavior of the proxy. For instance, adding a delay or degrading the response can be implemented in a behavior.
Both protocols and behaviors are plugins, allowing you to extend Vaurien by adding new ones.
Last (but not least), Vaurien provides a couple of APIs you can use to change the behavior of the proxy live. That’s handy when you are doing functional tests against your server: you can for instance start to add big delays and see how your web application reacts.
]]>proxy tcp vaurien chaos-monkey testing functional-testing failures sockets redis memcache httphttps://pinboard.in/https://pinboard.in/u:jm/b:c2b6d44f05b0/Poka-yoke (ポカヨケ)2014-03-06T09:58:59+00:00
http://en.wikipedia.org/wiki/Poka-Yoke
jmhuman-error errors mistakes poka-yoke failures prevention bugproofing manufacturing japanhttps://pinboard.in/https://pinboard.in/u:jm/b:93ac4e053bb1/the infamous 2008 S3 single-bit-corruption outage2013-06-05T23:06:01+00:00
http://status.aws.amazon.com/s3-20080720.html
jmWe've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether [gossip state] had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.
During our post-mortem analysis we've spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we're taking: (a) we've deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we've deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we've added additional monitoring and alarming of gossip rates and failures; and, (d) we're adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.
This is why you checksum all the things ;)
]]>s3 aws post-mortems network outages failures corruption grey-failures amazon gossiphttps://pinboard.in/https://pinboard.in/u:jm/b:aefb8be55cb7/“Call Me Maybe: Carly Rae Jepsen and the Perils of Network Partitions”2013-05-14T15:31:23+00:00
http://aphyr.com/media/talk.pdf
jmcrdts data-structures storage ricon apyhr failures network partitions puns slideshttps://pinboard.in/https://pinboard.in/u:jm/b:72311c964562/