Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmlouislam/uptime-kuma2024-02-26T12:11:22+00:00
https://github.com/louislam/uptime-kuma
jmmonitoring uptime network-monitoring networking ops via:itc via:tristamhttps://pinboard.in/https://pinboard.in/u:jm/b:01d82b0fa274/AWS Reliability Pillar Single-Region scenarios2023-10-12T13:29:48+00:00
https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/single-region-scenarios.html
jmvia:singer aws reliability architecture availability uptime services ops high-availabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:f9f9b0a1c1c2/AWS Fault Isolation Boundaries2022-11-22T12:30:51+00:00
https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/abstract-and-introduction.html?ck_subscriber_id=512829374
jmaws dependencies uptime reliability iamhttps://pinboard.in/https://pinboard.in/u:jm/b:f17c80e57a39/Roblox 73-hour outage write-up2022-01-24T10:56:22+00:00
https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/
jmThe root cause was due to two issues. Enabling a relatively new streaming feature on Consul under unusually high read and write load led to excessive contention and poor performance. In addition, our particular load conditions triggered a pathological performance issue in BoltDB. The open source BoltDB system is used within Consul to manage write-ahead-logs for leader election and data replication.
Also worth noting: 'We are working to move to multiple availability zones and data centers.']]>postmortems outages roblox games ops consul boltdb replication uptimehttps://pinboard.in/https://pinboard.in/u:jm/b:3f5e3d936d4b/Using load shedding to avoid overload2021-10-12T12:10:25+00:00
https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/
jmerror-handling distributed-systems http aws services soa david-yanacek load load-shedding uptime reliabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:c8ea792bb37a/The Calculus of Service Availability - ACM Queue2021-02-13T12:34:58+00:00
https://queue.acm.org/detail.cfm?id=3096459
jmsre availability architecture ops outages maths uptimehttps://pinboard.in/https://pinboard.in/u:jm/b:20b49f028384/SLA & Uptime calculator2020-11-26T13:43:11+00:00
https://uptime.is/99.9
jmuptime sla service downtime outages calculators toolshttps://pinboard.in/https://pinboard.in/u:jm/b:b3414ca66d59/Some notes on running new software in production2018-11-12T11:13:20+00:00
https://jvns.ca/blog/2018/11/11/understand-the-software-you-use-in-production/
jmreliability uptime slas kubernetes envoy outages runbooks opshttps://pinboard.in/https://pinboard.in/u:jm/b:b09f69a26ad2/airlift/jvmkill2018-07-05T09:29:02+00:00
https://github.com/airlift/jvmkill/blob/master/README.md
jma simple JVMTI agent that forcibly terminates the JVM when it is unable to allocate memory or create a thread. This is important for reliability purposes: an OutOfMemoryError will often leave the JVM in an inconsistent state. Terminating the JVM will allow it to be restarted by an external process manager.
This is apparently still useful despite the existence of '-XX:ExitOnOutOfMemoryError' as of java 8, since that may somehow still fail occasionally.]]>oom java reliability uptime memory opshttps://pinboard.in/https://pinboard.in/u:jm/b:db2c1986f022/Checkup2017-12-18T11:34:39+00:00
https://sourcegraph.github.io/checkup/
jmgo ops monitoring uptime health-checks status-pages status golang s3https://pinboard.in/https://pinboard.in/u:jm/b:cb7582683d84/Sorry2017-05-22T10:25:41+00:00
https://www.sorryapp.com/
jmbanners web status uptime downtime ops reliabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:be40e21a77d0/A server with 24 years of uptime2017-01-30T13:45:45+00:00
http://www.computerworld.com/article/3162416/data-center/booted-up-in-1993-this-server-still-runs-but-not-for-much-longer.html
jmstratus fault-tolerance hardware uptime records opshttps://pinboard.in/https://pinboard.in/u:jm/b:3ac940868596/Is Pokémon GO down or not?2016-07-20T10:14:40+00:00
http://ispokemongodownornot.com/
jmdatadog games monitoring pokemon-go pokemon uptimehttps://pinboard.in/https://pinboard.in/u:jm/b:a69910d88699/How Facebook avoids failures2015-11-02T16:30:59+00:00
http://queue.acm.org/detail.cfm?ref=rss&id=2839461
jmA "move-fast" mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook's infrastructure provides safety valves.
This is full of interesting techniques.
* Rapidly deployed configuration changes: Make everybody use a common configuration system; Statically validate configuration changes; Run a canary; Hold on to good configurations; Make it easy to revert.
* Hard dependencies on core services: Cache data from core services. Provide hardened APIs. Run fire drills.
* Increased latency and resource exhaustion: Controlled Delay (based on the anti-bufferbloat CoDel algorithm -- this is really cool); Adaptive LIFO (last-in, first-out) for queue busting; Concurrency Control (essentially a form of circuit breaker).
* Tools that Help Diagnose Failures: High-Density Dashboards with Cubism (horizon charts); What just changed?
* Learning from Failure: the DERP (!) methodology, ]]>ben-maurer facebook reliability algorithms codel circuit-breakers derp failure ops cubism horizon-charts charts dependencies soa microservices uptime deployment configuration change-managementhttps://pinboard.in/https://pinboard.in/u:jm/b:f84c5f60c6e9/A collection of postmortems2015-08-10T16:45:01+00:00
https://github.com/danluu/post-mortems
jmpostmortems ops uptime reliabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:9a079ba2cb35/Internet Scale Services Checklist2015-04-21T11:39:07+00:00
https://gist.github.com/acolyer/95ef23802803cb8b4eb5
jmjames-hamilton checklists ops internet-scale architecture operability monitoring reliability availability uptime aspirationshttps://pinboard.in/https://pinboard.in/u:jm/b:827aeebe6695/Yelp Product & Engineering Blog | True Zero Downtime HAProxy Reloads2015-04-14T14:55:59+00:00
http://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.html
jmlinux networking hacks yelp haproxy uptime reliability tcp tc qdisc opshttps://pinboard.in/https://pinboard.in/u:jm/b:fe434a846248/huptime2015-01-25T23:50:05+00:00
https://github.com/amscanne/huptime
jmlinux ops servers uptime restarting libc bind accept socketshttps://pinboard.in/https://pinboard.in/u:jm/b:8dc351b013f1/StatusPage.io2014-11-11T16:49:30+00:00
https://www.statuspage.io/
jmmonitoring server status outages uptime saas infrastructurehttps://pinboard.in/https://pinboard.in/u:jm/b:177d01ae1b08/'If I was your cloud provider, I'd never let you down'2013-06-25T21:22:59+00:00
http://www.joyent.com/blog/if-i-was-your-cloud-provider-i-d-never-let-you-down
jmWe’ve given our other partners 99.9999% uptime.
This despite a 10-day outage of their BingoDisk and Strongspace storage services in January 2008, 1734 days previously (http://www.datacenterknowledge.com/archives/2008/01/21/joyent-services-back-after-8-day-outage/).
If you assume that is the only outage they've had since then, that works out as 99.4% uptime. Quite a few less nines...]]>joyent marketing uptime two-nines fail strongdiskhttps://pinboard.in/https://pinboard.in/u:jm/b:92d8d2854689/