Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmworkaround for istio's graceful-shutdown lifecycle bug2023-12-19T13:39:58+00:00
https://gist.github.com/mmerickel/a2159c51d7a2486b9ac7057fa6b69139
jmistio fail bugs k8s sidecars work service-mesheshttps://pinboard.in/https://pinboard.in/u:jm/b:a185bca3daf8/AWS ALB returns 503 for Istio enabled pods2023-10-25T10:38:31+00:00
https://domagalski-j.medium.com/aws-alb-returns-503-for-istio-enabled-pods-a6942383143c
jmistio aws alb bugs networking tcp fail k8shttps://pinboard.in/https://pinboard.in/u:jm/b:2e144dfcee66/Kepler2023-07-23T16:04:58+00:00
https://sustainable-computing.io/
jmKubernetes Efficient Power Level Exporter (Kepler)
Kepler (Kubernetes-based Efficient Power Level Exporter) is a Prometheus exporter. It uses eBPF to probe CPU performance counters and Linux kernel tracepoints.
These data and stats from cgroup and sysfs can then be fed into ML models to estimate energy consumption by Pods.
]]>k8s kubernetes kepler power prometheus ebpf energyhttps://pinboard.in/https://pinboard.in/u:jm/b:35adb74d18d0/eldadru/ksniff2023-07-12T13:58:22+00:00
https://github.com/eldadru/ksniff
jmdebugging kubernetes network networking packet-captures tcpdump wireshark ops k8s eks sniffing kubectlhttps://pinboard.in/https://pinboard.in/u:jm/b:f25141e7690f/Istio: 503's with UC's and TCP Fun Times2023-07-12T13:54:04+00:00
https://karlstoney.com/2019/05/31/istio-503s-ucs-and-tcp-fun-times/
jmistio kubernetes k8s eks http tcp timeouts connection-pools networkinghttps://pinboard.in/https://pinboard.in/u:jm/b:3e01e8a914df/Carbon aware temporal shifting of Kubernetes workloads using KEDA2023-06-06T14:52:06+00:00
https://rossfairbanks.com/2023/06/05/carbon-aware-temporal-shifting-with-keda/
jmcarbon co2 keda k8s scheduling ops scaling autoscaling microsoft sustainabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:be694b8fe07b/You Broke Reddit: The Pi-Day Outage : RedditEng2023-03-22T10:29:42+00:00
https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_broke_reddit_the_piday_outage/
jmk8s kubernetes outages reddit ops post-mortemshttps://pinboard.in/https://pinboard.in/u:jm/b:0df964024850/Out-of-memory (OOM) in Kubernetes2022-09-30T17:07:11+00:00
https://mihai-albert.com/2022/02/13/out-of-memory-oom-in-kubernetes-part-2-the-oom-killer-and-application-runtime-implications/
jmlinux memory oom-killer ooms k8s kubernetes cgroups errorshttps://pinboard.in/https://pinboard.in/u:jm/b:7c73f0256951/Bringing emulation into the 21st century2022-04-25T17:03:05+00:00
https://blog.davetcode.co.uk/post/21st-century-emulator/
jmAn 8080 microprocessor utilising a modern, containerised, microservices-based architecture running on Kubernetes with frontends for a CP/M test harness and a full implementation of the original Space Invaders arcade machine. The full project can be found as a github organisation https://github.com/21st-century-emulation which contains ~60 individual repositories each implementing an individual microservice or providing the infrastructure.
Needless to say this monster runs at approximately 1KHz, instead of the required 2MHz. A good demo of how some deliberately obtuse and inappropriate architectural decisions can really make a mess of things.]]>emulation kubernetes satire k8s containers microservices yikeshttps://pinboard.in/https://pinboard.in/u:jm/b:c176a7cd23a6/"FAANG promo committees are killing Kubernetes"2022-04-07T10:01:42+00:00
https://twitter.com/kantrn/status/1511791378497384454
jmPromo committees have, for years now, been consistently undervaluing the work of full-time Kubernetes contributors. Or really of open source work more broadly. Attributable revenue has been taking over as one of the most important factors at most companies. And Kubernetes has very little of that. It's happened gradually, and I don't think this was ever an intended outcome but it's a thing and we have to live with it.
It's too indirect, fixing a bug in kube-apiserver might retain a GCP customer or avoid a costly Apple services outage, but can you put a dollar value on that? How much is CI stability worth? Or community happiness?
And then add on top of it, the time cost. "FOSS maintainers are overloaded" should not be news to anyone, but now add 20/hours a week of campaigning to other high-level folks to "build buzz" for your work and let me know how that goes.
]]>k8s open-source google amazon faang work promotions careerhttps://pinboard.in/https://pinboard.in/u:jm/b:0e5c1b6b53ce/The CFS quota container throttling problem2021-12-19T22:01:40+00:00
https://danluu.com/cgroup-throttling/
jmAlmost all services at Twitter run on Linux with the CFS scheduler, using CFS bandwidth control quota for isolation, with default parameters. The intention is to allow different services to be colocated on the same boxes without having one service's runaway CPU usage impact other services and to prevent services on empty boxes from taking all of the CPU on the box, resulting in unpredictable performance, which service owners found difficult to reason about before we enabled quotas. The quota mechanism limits the amortized CPU usage of each container, but it doesn't limit how many cores the job can use at any given moment. Instead, if a job "wants to" use more than that many cores over a quota timeslice, it will use more cores than its quota for a short period of time and then get throttled, i.e., basically get put to sleep, in order to keep its amortized core usage below the quota, which is disastrous for tail latency1.
Since the vast majority of services at Twitter use thread pools that are much larger than their mesos core reservation, when jobs have heavy load, they end up requesting and then using more cores than their reservation and then throttling. This causes services that are provisioned based on load test numbers or observed latency under load to over provision CPU to avoid violating their SLOs. They either have to ask for more CPUs per shard than they actually need or they have to increase the number of shards they use.
Note that Kubernetes uses CFS to implement CPU quotas by default, too.
In the twitter thread about this post, a commenter noted: "'By shrinking the CFS period, the worst case time between quota exhaustion causing throttling and the process group being able to run again is reduced proportionately'. Our P99s at previous gig reduced in line after I petitioned cloud provider to adjust setting." --- this at least seems like a relatively easy setting to tune.
]]>cgroups kubernetes linux k8s cfs scheduling containers quotashttps://pinboard.in/https://pinboard.in/u:jm/b:c6f83c36887b/The 17 Ways to Run Containers on AWS - Last Week in AWS2021-05-31T14:36:45+00:00
https://www.lastweekinaws.com/blog/the-17-ways-to-run-containers-on-aws/?ck_subscriber_id=512829374
jmcontainers aws ec2 eks k8s docker architecturehttps://pinboard.in/https://pinboard.in/u:jm/b:0efe3719a580/Sidecar injection and transparent traffic hijacking process in Istio explained in detail · Jimmy Song2021-04-29T13:11:00+00:00
https://jimmysong.io/en/blog/sidecar-injection-iptables-and-traffic-routing/
jmkubernetes iptables sidecars istio service-mesh networking k8s eks routinghttps://pinboard.in/https://pinboard.in/u:jm/b:3b0dfec4d9d9/Feh/nocache2020-09-23T10:43:42+00:00
https://github.com/Feh/nocache
jm
The nocache tool tries to minimize the effect an application has on the Linux file system cache. This is done by intercepting the open and close system calls and calling posix_fadvise() with the POSIX_FADV_DONTNEED parameter. Because the library remembers which pages (ie., 4K-blocks of the file) were already in file system cache when the file was opened, these will not be marked as "don't need", because other applications might need that, although they are not actively used (think: hot standby).
]]>cache linux memory performance filesystems backup k8s unix fadvisehttps://pinboard.in/https://pinboard.in/u:jm/b:bd12d7ea46ac/How to monitor Golden signals in Kubernetes2020-01-09T17:19:22+00:00
https://sysdig.com/blog/golden-signals-kubernetes/
jmkubernetes monitoring sysdig golden-data k8s golden-signals metrics latency errorshttps://pinboard.in/https://pinboard.in/u:jm/b:c8f356dfeae1/Tinder’s move to Kubernetes – Tinder Engineering – Medium2019-04-18T12:56:05+00:00
https://medium.com/@tinder.engineering/tinders-move-to-kubernetes-cda2a6372f44
jmkubernetes k8s flannel networking elb aws envoy ec2 ops tinderhttps://pinboard.in/https://pinboard.in/u:jm/b:68e5fb0e2425/Argo Workflows & Pipelines2019-02-28T10:44:37+00:00
https://argoproj.github.io/argo
jmk8s kubernetes docker containers workflow pipelines architecture batch nightly-jobs opshttps://pinboard.in/https://pinboard.in/u:jm/b:0cd7daacad31/Productionproofing EKS2018-11-02T14:11:29+00:00
https://medium.com/@deiwin/productionproofing-eks-ed52951ffd6c
jmeks aws docker kubernetes k8s ops prodhttps://pinboard.in/https://pinboard.in/u:jm/b:e899cd56151e/pusher/k8s-spot-rescheduler2018-10-23T11:40:15+00:00
https://github.com/pusher/k8s-spot-rescheduler
jm
K8s Spot rescheduler is a tool that tries to reduce load on a set of Kubernetes nodes. It was designed with the purpose of moving Pods scheduled on AWS on-demand instances to AWS spot instances to allow the on-demand instances to be safely scaled down (By the Cluster Autoscaler).
In reality the rescheduler can be used to remove load from any group of nodes onto a different group of nodes. They just need to be labelled appropriately.
For example, it could also be used to allow controller nodes to take up slack while new nodes are being scaled up, and then rescheduling those pods when the new capacity becomes available, thus reducing the load on the controllers once again.
]]>k8s kubernetes aws scaling spot-instances opshttps://pinboard.in/https://pinboard.in/u:jm/b:7c2b45ac972d/Kubernetes Best Practices // Speaker Deck2017-07-24T13:35:34+00:00
https://speakerdeck.com/thesandlord/kubernetes-best-practices
jmk8s kubernetes devops ops containers docker best-practices tips packaginghttps://pinboard.in/https://pinboard.in/u:jm/b:b5d166ee5797/The Three Go Landmines2016-03-16T15:04:14+00:00
https://gist.github.com/lavalamp/4bd23295a9f32706a48f
jmk8s go golang errors coding bugshttps://pinboard.in/https://pinboard.in/u:jm/b:8f40f830717b/