Pinboard (jm)

Pinboard (jm) https://pinboard.in/u:jm/public/ recent bookmarks from jm What the hell have you built. 2025-11-06T10:38:19+00:00 https://wthhyb.sacha.house/ jm shitposting funny architecture riak redis mongodb ouch scalability https://pinboard.in/ https://pinboard.in/u:jm/b:0420d47b34a3/ Optimizing Java Apps on Kubernetes 2025-01-23T14:45:35+00:00 https://www.infoq.com/presentations/optimizing-java-app-kubernetes/ jm kubernetes java eks resources ops scaling scalability gc optimization jvm https://pinboard.in/ https://pinboard.in/u:jm/b:2c9522ab8d41/ Distributed Postgres goes full open source with Citus 2022-09-14T11:42:00+00:00 https://www.citusdata.com/blog/2022/09/12/distributed-postgres-goes-full-open-source-with-citus/ jm postgres citus oss scalability infrastructure https://pinboard.in/ https://pinboard.in/u:jm/b:1f345ede6c20/ Key Takeaways from the DynamoDB Paper 2022-08-08T10:21:17+00:00 https://www.alexdebrie.com/posts/dynamodb-paper/ jm scalability scaling dynamodb aws storage services architecture https://pinboard.in/ https://pinboard.in/u:jm/b:3f183f470778/ Apache Helix 2021-07-30T09:37:21+00:00 https://github.com/apache/helix jm zookeeper helix sharding scalability scaling via:kishorebytes partitioning architecture https://pinboard.in/ https://pinboard.in/u:jm/b:558ffe51f061/ Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance | talawah.io 2021-05-21T08:54:01+00:00 https://talawah.io/blog/extreme-http-performance-tuning-one-point-two-million/ jm http servers c10k linux performance scalability ops tuning libreactor networking tcp https://pinboard.in/ https://pinboard.in/u:jm/b:022774113a4b/ Operating Apache Kafka Clusters 24/7 Without A Global Ops Team 2019-10-02T10:00:01+00:00 https://eng.lyft.com/operating-apache-kafka-clusters-24-7-without-a-global-ops-team-417813a5ce70 jm autoremediation failures ops kafka scalability automation https://pinboard.in/ https://pinboard.in/u:jm/b:457b3e9b4a93/ Modeling the Mythical Man-Month using the Universal Scalability Law 2019-07-16T09:45:03+00:00 http://perfdynamics.blogspot.com/2007/11/modeling-mythical-man-month.html jm usl scalability scaling brooks teams mythical-man-month estimation https://pinboard.in/ https://pinboard.in/u:jm/b:45d2bc17ca47/ The log/event processing pipeline you can't have - apenwarr 2019-02-18T00:19:49+00:00 https://apenwarr.ca/log/20190216 jmSimple things don't break. Our friends on the "let's use structured events to make metrics" team streamed those events straight into a database, and it broke all the time, because databases have configuration options and you inevitably set those options wrong, and it'll fall over under heavy load, and you won't find out until you're right in the middle of an emergency and you really want to see those logs. Or events. ]]> logging scalability klog kernel log-processing events embedded ops https://pinboard.in/ https://pinboard.in/u:jm/b:dbdd8c40549f/ _Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes_, SIGMOD '18 2019-01-16T10:08:18+00:00 https://www.dropbox.com/s/47xbjrkni9bx0g3/aurora2.pdf?dl=0 jmOne of the more novel differences between Aurora and other relational databases is how it pushes redo processing to a multi-tenant scale-out storage service, purpose-built for Aurora. Doing so reduces networking traffic, avoids checkpoints and crash recovery, enables failovers to replicas without loss of data, and enables fault-tolerant storage that heals without database involvement. Traditional implementations that leverage distributed storage would use distributed consensus algorithms for commits, reads, replication, and membership changes and amplify cost of underlying storage. In this paper, we describe how Aurora avoids distributed consensus under most circumstances by establishing invariants and leveraging local transient state. Doing so improves performance, reduces variability, and lowers costs. ]]> papers toread aurora amazon aws pdf scalability distcomp state sql mysql postgresql distributed-consensus https://pinboard.in/ https://pinboard.in/u:jm/b:db64811fffaf/ awslabs/amazon-kinesis-scaling-utils 2018-12-18T12:16:24+00:00 https://github.com/awslabs/amazon-kinesis-scaling-utils jmThe Kinesis Scaling Utility is designed to give you the ability to scale Amazon Kinesis Streams in the same way that you scale EC2 Auto Scaling groups – up or down by a count or as a percentage of the total fleet. You can also simply scale to an exact number of Shards. There is no requirement for you to manage the allocation of the keyspace to Shards when using this API, as it is done automatically. You can also deploy the Web Archive to a Java Application Server, and allow Scaling Utils to automatically manage the number of Shards in the Stream based on the observed PUT or GET rate of the stream. ]]> kinesis scaling scalability shards sharding ops https://pinboard.in/ https://pinboard.in/u:jm/b:f4fd1b252af9/ Running high-scale web applications on Amazon EC2 Spot Instances 2018-10-03T16:09:14+00:00 https://aws.amazon.com/blogs/compute/running-high-scale-web-on-spot-instances/ jm appnext spot-instances ec2 scalability aws ops architecture https://pinboard.in/ https://pinboard.in/u:jm/b:e099fd008834/ Rethinking Netflix’s Edge Load Balancing – Netflix TechBlog 2018-10-02T11:43:13+00:00 https://medium.com/netflix-techblog/netflix-edge-load-balancing-695308b5548c jm netflix scaling scalability jsq load-balancing load-balancers algorithms distributed-systems architecture ops https://pinboard.in/ https://pinboard.in/u:jm/b:5e6299bd9bb6/ The problems with DynamoDB Auto Scaling and how it might be improved 2018-07-12T11:56:27+00:00 https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b jm dynamodb autoscaling ops scalability aws scaling capacity https://pinboard.in/ https://pinboard.in/u:jm/b:6d475063c650/ Locking, Little's Law, and the USL 2017-09-20T14:36:55+00:00 https://groups.google.com/forum/#!msg/mechanical-sympathy/gchG_oQ_kQM/59BDMOdUAwAJ jmLittle's law can be used to describe a system in steady state from a queuing perspective, i.e. arrival and leaving rates are balanced. In this case it is a crude way of modelling a system with a contention percentage of 100% under Amdahl's law, in that throughput is one over latency. However this is an inaccurate way to model a system with locks. Amdahl's law does not account for coherence costs. For example, if you wrote a microbenchmark with a single thread to measure the lock cost then it is much lower than in a multi-threaded environment where cache coherence, other OS costs such as scheduling, and lock implementations need to be considered. Universal Scalability Law (USL) accounts for both the contention and the coherence costs. http://www.perfdynamics.com/Manifesto/USLscalability.html When modelling locks it is necessary to consider how contention and coherence costs vary given how they can be implemented. Consider in Java how we have biased locking, thin locks, fat locks, inflation, and revoking biases which can cause safe points that bring all threads in the JVM to a stop with a significant coherence component. ]]> usl scaling scalability performance locking locks java jvm amdahls-law littles-law system-dynamics modelling systems caching threads schedulers contention https://pinboard.in/ https://pinboard.in/u:jm/b:d64fb1279a0b/ usl4j And You | codahale.com 2017-06-01T10:08:29+00:00 https://codahale.com/usl4j-and-you/ jm usl scalability java performance optimization benchmarking measurement ops coda-hale https://pinboard.in/ https://pinboard.in/u:jm/b:c184c035e80a/ Scaling Amazon Aurora at ticketea 2017-05-29T16:20:46+00:00 https://engineering.ticketea.com/scaling-amazon-aurora-at-ticketea/?__s=gf36pf8g1gjugcqh6ppo jmTicketing is a business in which extreme traffic spikes are the norm, rather than the exception. For Ticketea, this means that our traffic can increase by a factor of 60x in a matter of seconds. This usually happens when big events (which have a fixed, pre-announced 'sale start time') go on sale. ]]> scaling scalability ops aws aurora autoscaling asg https://pinboard.in/ https://pinboard.in/u:jm/b:78ee8d992f0b/ Learn redis the hard way (in production) · trivago techblog 2017-03-30T10:01:30+00:00 http://tech.trivago.com/2017/01/25/learn-redis-the-hard-way-in-production/ jm redis scalability ops architecture horror trivago php https://pinboard.in/ https://pinboard.in/u:jm/b:86839e5457c2/ Cherami: Uber Engineering’s Durable and Scalable Task Queue in Go - Uber Engineering Blog 2016-12-14T11:21:39+00:00 https://eng.uber.com/cherami/ jm a competing-consumer messaging queue that is durable, fault-tolerant, highly available and scalable. We achieve durability and fault-tolerance by replicating messages across storage hosts, and high availability by leveraging the append-only property of messaging queues and choosing eventual consistency as our basic model. Cherami is also scalable, as the design does not have single bottleneck. [...] Cherami is completely written in Go, a language that makes building highly performant and concurrent system software a lot of fun. Additionally, Cherami uses several libraries that Uber has already open sourced: TChannel for RPC and Ringpop for health checking and group membership. Cherami depends on several third-party open source technologies: Cassandra for metadata storage, RocksDB for message storage, and many other third-party Go packages that are available on GitHub. We plan to open source Cherami in the near future. ]]> cherami uber queueing tasks queues architecture scalability go cassandra rocksdb https://pinboard.in/ https://pinboard.in/u:jm/b:08be41dbc892/ Auto scaling Pinterest 2016-12-02T17:44:44+00:00 https://engineering.pinterest.com/blog/auto-scaling-pinterest jm spot-instances scaling aws scalability ops architecture pinterest via:highscalability https://pinboard.in/ https://pinboard.in/u:jm/b:f049baec271a/ "Solving Imaginary Scaling Issues At Scale — Getting the wrong idea from that conference talk you attended" 2016-11-22T21:28:57+00:00 https://twitter.com/frontstack/status/800889593855737856 jm Chapter 1: Databases with cool-sounding names. Chapter 2: using BitTorrent for everything. Chapter 3: forget Torrents. Use the blockchain for everything. Chapter 4: sharding the database before adding any indexes. Chapter 5: upgrading to faster processors without checking if you're limited by disk I/O. Chapter 6: rewriting APIs in C for speed without compressing data on the wire. Chapter 7: putting large blobs of binary data into SQL databases for fun and profit. Chapter 8: using protobufs to poll 300 times per second. Chapter 9: diagnose scaling issues by grepping 10 lines of code and guessing. Chapter 10: putting Varnish in front of everything just in case. Chapter 11: buying boxes with gigantic amounts of RAM. Chapter 12: realizing your HAProxy box is still a micro instance. Chapter 13: rewriting 3 of 10 features in Go and declaring victory. Chapter 14: split everything into 35 microservices all maintained by 1 person. Chapter 15: 300% performance boosts by deleting data validity checks. Chapter 16: minifying the JS of your O(n^3) to-do list. Chapter 17: Fuck It, Let's Try Erlang. Chapter 18: Blaming Everything On The Last Person To Quit. Chapter 19: A Bloom Filter Will Definitely Fix This. Chapter 20: Move all client-side processing to the server and/or vice-versa. Chapter 21: Putting A Node.js Proxy In Front Of Our COBOL Backend Will Definitely Improve Matters. Chapter 22: A Type-Checked Transpilation Step Will Surely Speed Things Up. Chapter 23: Writing A New Language Almost The Same As Your Old Language But Faster (guest chapter by Facebook). Chapter 24: Replacing an SQL DB with a NoSQL DB then implementing SQL in your ORM. Chapter 25: Migrating From Bare Metal To The Cloud Or Vice-Versa, Whichever You're Not Currently Doing. Chapter 26: Putting everything behind a CDN except the slow, complicated parts. Chapter 27: Applying distributed map-reduce to less than 1 gigabyte of data. Chapter 28: Running exactly the same software, but in Docker. Chapter 29: Machine learning: how it will magically fix your crappy code. Chapter 30: Blaming your package manager for slow run-time performance. Chapter 31: Moving processing from the CPU to the GPU without changing the algorithm. Chapter 32: Switching To Heroku Or Away From Heroku Or A Hybrid Heroku-AWS model, whichever sounds the most fun. Chapter 33: Loading all your dependencies from somebody else's github repo. Chapter 34: optimizing your PNGs while hosting 300MB video ads. Chapter 35: hosting your database in memory and your images on S3. ]]> scalability funny lol twitter oreilly https://pinboard.in/ https://pinboard.in/u:jm/b:d6174c4d603d/ Service discovery at Stripe 2016-11-01T12:20:50+00:00 https://stripe.com/blog/service-discovery-at-stripe jm consul api microservices service-discovery dns load-balancing l7 tcp distcomp smartstack stripe cap-theorem scalability https://pinboard.in/ https://pinboard.in/u:jm/b:c7313c149028/ Kafka Streams - Scaling up or down 2016-10-13T10:58:32+00:00 http://aseigneurin.github.io/2016/10/07/kafka-streams-scaling-up-or-down.html jm scaling scalability architecture kafka streams ops https://pinboard.in/ https://pinboard.in/u:jm/b:dc796f3ac598/ How to Quantify Scalability 2016-09-26T10:00:52+00:00 http://www.perfdynamics.com/Manifesto/USLscalability.html jm usl performance scalability concurrency capacity measurement excel equations metrics https://pinboard.in/ https://pinboard.in/u:jm/b:ab7e86dd0bb0/ Hashed Wheel Timer 2016-03-29T12:02:43+00:00 https://github.com/ifesdjeen/hashed-wheel-timer jm scalability java timers hashed-wheel-timers algorithms data-structures https://pinboard.in/ https://pinboard.in/u:jm/b:6139ec69af2a/ Topics in High-Performance Messaging 2015-12-02T15:53:00+00:00 https://www.informatica.com/downloads/1568_high_perf_messaging_wp/Topics-in-High-Performance-Messaging.htm jm messaging scalability scaling performance udp tcp protocols multicast latency https://pinboard.in/ https://pinboard.in/u:jm/b:aef1848d9376/ John Nagle on delayed ACKs and his algorithm 2015-11-22T22:18:18+00:00 https://news.ycombinator.com/item?id=10608356 jm networking performance scalability nagle tcp ip https://pinboard.in/ https://pinboard.in/u:jm/b:fe312c2af198/ SuperChief: From Apache Storm to In-House Distributed Stream Processing 2015-10-12T09:29:22+00:00 http://blog.librato.com/posts/superchief jmStorm has been successful at Librato, but we experienced many of the limitations cited in the Twitter Heron: Stream Processing at Scale paper and outlined here by Adrian Colyer, including: Inability to isolate, reason about, or debug performance issues due to the worker/executor/task paradigm. This led to building and configuring clusters specifically designed to attempt to mitigate these problems (i.e., separate clusters per topology, only running a worker per server.), which added additional complexity to development and operations and also led to over-provisioning. Ability of tasks to move around led to difficult to trace performance problems. Storm’s work provisioning logic led to some tasks serving more Kafka partitions than others. This in turn created latency and performance issues that were difficult to reason about. The initial solution was to over-provision in an attempt to get a better hashing/balancing of work, but eventually we just replaced the work allocation logic. Due to Storm’s architecture, it was very difficult to get a stack trace or heap dump because the processes that managed workers (Storm supervisor) would often forcefully kill a Java process while it was being investigated in this way. The propensity for unexpected and subsequently unhandled exceptions to take down an entire worker led to additional defensive verbose error handling everywhere. This nasty bug STORM-404 coupled with the aforementioned fact that a single exception can take down a worker led to several cascading failures in production, taking down entire topologies until we upgraded to 0.9.4. Additionally, we found the performance we were getting from Storm for the amount of money we were spending on infrastructure was not in line with our expectations. Much of this is due to the fact that, depending upon how your topology is designed, a single tuple may make multiple hops across JVMs, and this is very expensive. For example, in our time series aggregation topologies a single tuple may be serialized/deserialized and shipped across the wire 3-4 times as it progresses through the processing pipeline. ]]> scalability storm kafka librato architecture heron ops https://pinboard.in/ https://pinboard.in/u:jm/b:b175a6749098/ Uber Goes Unconventional: Using Driver Phones as a Backup Datacenter - High Scalability 2015-09-23T21:54:42+00:00 http://highscalability.com/blog/2015/9/21/uber-goes-unconventional-using-driver-phones-as-a-backup-dat.html jm scalability failover multi-dc uber replication state crdts https://pinboard.in/ https://pinboard.in/u:jm/b:400d153ebfed/ You're probably wrong about caching 2015-09-07T10:49:36+00:00 https://msol.io/blog/tech/2015/09/05/youre-probably-wrong-about-caching/ jm architecture caching coding design caches ops production scalability https://pinboard.in/ https://pinboard.in/u:jm/b:580ce012d9e2/ What does it take to make Google work at scale? [slides] 2015-08-31T11:29:51+00:00 https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0/edit#slide=id.gb74341dde_1_31 jm google architecture slides scalability bigtable spanner facebook gfs storage https://pinboard.in/ https://pinboard.in/u:jm/b:ee623f8402ef/ Patrick Shuff - Building A Billion User Load Balancer - SCALE 13x - YouTube 2015-06-22T09:50:27+00:00 https://www.youtube.com/watch?v=MKgJeqF1DHw jm facebook video talks lbs load-balancing http https scalability scale linux https://pinboard.in/ https://pinboard.in/u:jm/b:2ad80cce86ff/ Discretized Streams: Fault Tolerant Stream Computing at Scale 2015-06-19T07:47:04+00:00 http://blog.acolyer.org/2015/06/19/discretized-streams-fault-tolerant-stream-computing-at-scale/ jmwe use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher. ]]> rdd spark streaming fault-tolerance batch distcomp papers big-data scalability https://pinboard.in/ https://pinboard.in/u:jm/b:561d8372a2de/ Leveraging AWS to Build a Scalable Data Pipeline 2015-06-14T21:22:02+00:00 http://highscalability.com/blog/2015/6/8/leveraging-aws-to-build-a-scalable-data-pipeline.html jm sqs aws ec2 auto-scaling asg worker-pools architecture scalability https://pinboard.in/ https://pinboard.in/u:jm/b:83cb65158dca/ Elements of Scale: Composing and Scaling Data Platforms 2015-05-25T15:58:46+00:00 http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/ jm architecture storage databases data big-data scaling scalability ben-stopford cqrs druid parquet columnar-stores lambda-architecture https://pinboard.in/ https://pinboard.in/u:jm/b:50954c7dd941/ Why Loggly loves Apache Kafka 2015-05-06T11:19:20+00:00 http://www.developer-tech.com/news/2014/jun/10/why-loggly-loves-apache-kafka-how-unbreakable-infinitely-scalable-messaging-makes-log-management-better/ jm scalability logging loggly kafka queueing ops reliabilty https://pinboard.in/ https://pinboard.in/u:jm/b:61df5b7109a3/ ferd.ca -> Lessons Learned while Working on Large-Scale Server Software 2015-04-22T15:26:07+00:00 http://ferd.ca/lessons-learned-while-working-on-large-scale-server-software.html jm distributed scalability systems coding server-side erlang devops networking reliability https://pinboard.in/ https://pinboard.in/u:jm/b:4b4817db08ed/ How We Scale VividCortex's Backend Systems - High Scalability 2015-03-30T16:55:14+00:00 http://highscalability.com/blog/2015/3/30/how-we-scale-vividcortexs-backend-systems.html jm time-series tsd storage mysql sql baron-schwartz ops performance scalability scaling go https://pinboard.in/ https://pinboard.in/u:jm/b:fe014fc1ee1b/ Services Engineering Reading List 2015-03-03T10:37:29+00:00 https://github.com/mmcgrana/services-engineering jm architecture papers reading reliability scalability articles to-read https://pinboard.in/ https://pinboard.in/u:jm/b:2db5b491b523/ Are you better off running your big-data batch system off your laptop? 2015-01-17T21:33:33+00:00 http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html jmHere are two helpful guidelines (for largely disjoint populations): If you are going to use a big data system for yourself, see if it is faster than your laptop. If you are going to build a big data system for others, see that it is faster than my laptop. [...] We think everyone should have to do this, because it leads to better systems and better research. ]]> graph coding hadoop spark giraph graph-processing hardware scalability big-data batch algorithms pagerank https://pinboard.in/ https://pinboard.in/u:jm/b:229db78fb862/ Doing Constant Work to Avoid Failures 2014-11-07T15:09:13+00:00 http://www.awsarchitectureblog.com/2014/06/constant-work.html jm scalability scaling architecture aws route53 via:brianscanlan overload constant-load loading https://pinboard.in/ https://pinboard.in/u:jm/b:b94dca788ad3/ Carbon vs Megacarbon and Roadmap ? · Issue #235 · graphite-project/carbon 2014-10-29T11:59:07+00:00 https://github.com/graphite-project/carbon/issues/235 jmCarbon is a great idea, but fundamentally, twisted doesn't do what carbon-relay or carbon-aggregator were built to do when hit with sustained and heavy throughput. Much to my chagrin, concurrency isn't one of python's core competencies. +1, sadly. We are patching around the edges with half-released third-party C rewrites in our graphite setup, as we exceed the scale Carbon can support.]]> carbon graphite metrics ops python twisted scalability https://pinboard.in/ https://pinboard.in/u:jm/b:c81f1dde8791/ On-Demand Jenkins Slaves With Amazon EC2 2014-08-29T23:01:15+00:00 http://artsy.github.io/blog/2012/07/10/on-demand-jenkins-slaves-with-amazon-ec2/ jm testing jenkins ec2 spot-instances scalability auto-scaling ops build https://pinboard.in/ https://pinboard.in/u:jm/b:49ce775cc829/ Auto Scale DynamoDB With Dynamic DynamoDB 2014-07-22T13:20:58+00:00 http://aws.amazon.com/blogs/aws/auto-scale-dynamodb-with-dynamic-dynamodb/ jm dynamodb autoscaling scalability provisioning aws ec2 cloudformation https://pinboard.in/ https://pinboard.in/u:jm/b:c15cbca57e7c/ Google's Influential Papers for 2013 2014-07-09T16:40:48+00:00 http://googleresearch.blogspot.ie/2014/06/influential-papers-for-2013.html jmGooglers across the company actively engage with the scientific community by publishing technical papers, contributing open-source packages, working on standards, introducing new APIs and tools, giving talks and presentations, participating in ongoing technical debates, and much more. Our publications offer technical and algorithmic advances, feature aspects we learn as we develop novel products and services, and shed light on some of the technical challenges we face at Google. Below are some of the especially influential papers co-authored by Googlers in 2013. ]]> google papers toread reading 2013 scalability machine-learning algorithms https://pinboard.in/ https://pinboard.in/u:jm/b:c2e0e542b7ca/ Google Replaces MapReduce With New Hyper-Scale Cloud Analytics System 2014-06-26T12:42:20+00:00 http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/ jm“We don’t really use MapReduce anymore,” [Urs] Hölzle said in his keynote presentation at the Google I/O conference in San Francisco Wednesday. The company stopped using the system “years ago.” Cloud Dataflow, which Google will also offer as a service for developers using its cloud platform, does not have the scaling restrictions of MapReduce. “Cloud Dataflow is the result of over a decade of experience in analytics,” Hölzle said. “It will run faster and scale better than pretty much any other system out there.” Gossip on the mech-sympathy list says that 'seems that the new platform taking over is a combination of FlumeJava and MillWheel: http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf , http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41378.pdf']]> map-reduce google hadoop cloud-dataflow scalability big-data urs-holzle google-io https://pinboard.in/ https://pinboard.in/u:jm/b:a9ddb55ae3e1/ Shutterbits replacing hardware load balancers with local BGP daemons and anycast 2014-05-29T10:07:07+00:00 http://bits.shutterstock.com/2014/05/22/stop-buying-load-balancers-and-start-controlling-your-traffic-flow-with-software/ jm scalability networking performance load-balancing bgp exabgp ospf anycast routing datacenters scaling vips juniper haproxy shutterstock https://pinboard.in/ https://pinboard.in/u:jm/b:55674ebcedb2/ Spark Streaming 2014-05-16T21:35:38+00:00 http://spark.apache.org/docs/latest/streaming-programming-guide.html#overview jman extension of the core Spark API that allows enables high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s in-built machine learning algorithms, and graph processing algorithms on data streams. ]]> spark streams stream-processing cep scalability apache machine-learning graphs https://pinboard.in/ https://pinboard.in/u:jm/b:62c1e3c0e756/ Why Disqus made the Python->Go switchover 2014-05-08T13:34:52+00:00 https://news.ycombinator.com/item?id=7711974 jmat higher contention, the CPU was choking everything. Switching over to Go removed that contention for us, which was the primary issue that we were seeing. ]]> python languages concurrency go threading gevent scalability disqus realtime hn https://pinboard.in/ https://pinboard.in/u:jm/b:7735324765d8/ An analysis of Facebook photo caching 2014-05-07T12:53:52+00:00 https://code.facebook.com/posts/220956754772273/an-analysis-of-facebook-photo-caching/ jm via:fanf caching facebook architecture photos images cache fifo lru scalability https://pinboard.in/ https://pinboard.in/u:jm/b:3b5a7ad7f689/ Scalable Atomic Visibility with RAMP Transactions 2014-04-10T20:55:17+00:00 http://www.bailis.org/blog/scalable-atomic-visibility-with-ramp-transactions/ jmWe’ve developed three new algorithms—called Read Atomic Multi-Partition (RAMP) Transactions—for ensuring atomic visibility in partitioned (sharded) databases: either all of a transaction’s updates are observed, or none are. [...] How they work: RAMP transactions allow readers and writers to proceed concurrently. Operations race, but readers autonomously detect the races and repair any non-atomic reads. The write protocol ensures readers never stall waiting for writes to arrive. Why they scale: Clients can’t cause other clients to stall (via synchronization independence) and clients only have to contact the servers responsible for items in their transactions (via partition independence). As a consequence, there’s no mutual exclusion or synchronous coordination across servers. The end result: RAMP transactions outperform existing approaches across a variety of workloads, and, for a workload of 95% reads, RAMP transactions scale to over 7 million ops/second on 100 servers at less than 5% overhead. ]]> scale synchronization databases distcomp distributed ramp transactions scalability peter-bailis protocols sharding concurrency atomic partitions https://pinboard.in/ https://pinboard.in/u:jm/b:bb652343d9e6/ 'Scaling to Millions of Simultaneous Connections' [pdf] 2014-02-20T14:24:00+00:00 http://www.erlang-factory.com/upload/presentations/558/efsf2012-whatsapp-scaling.pdf jm erlang scaling scalability performance whatsapp freebsd presentations https://pinboard.in/ https://pinboard.in/u:jm/b:67245dcffadb/ Little’s Law, Scalability and Fault Tolerance: The OS is your bottleneck. What you can do? 2014-02-05T17:35:26+00:00 http://highscalability.com/blog/2014/2/5/littles-law-scalability-and-fault-tolerance-the-os-is-your-b.html jm jvm java quasar pulsar comsat littles-law scalability async erlang https://pinboard.in/ https://pinboard.in/u:jm/b:bb3e77510a90/ Extending graphite’s mileage 2014-01-27T10:35:04+00:00 http://www.inmobi.com/blog/2014/01/24/extending-graphites-mileage jmThe carbon server is now able to run without breaking a sweat even when 500K metrics per minute is being pumped into it. This has been in production since late August 2013 in every datacenter that we operate from. Very nice. I hope this gets merged/supported.]]> graphite scalability metrics leveldb storage inmobi whisper carbon open-source https://pinboard.in/ https://pinboard.in/u:jm/b:8df674ec27ce/ Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask 2013-10-21T16:32:15+00:00 http://sigops.org/sosp/sosp13/papers/p33-david.pdf jm synchronization scalability cpus hardware papers via:fanf multicore cas https://pinboard.in/ https://pinboard.in/u:jm/b:f3c2f37df2b0/ Non-blocking transactional atomicity 2013-10-07T21:01:01+00:00 http://www.bailis.org/blog/non-blocking-transactional-atomicity/ jm algorithms database distributed scalability storage peter-bailis distcomp https://pinboard.in/ https://pinboard.in/u:jm/b:b97a35baf620/ Why Tellybug moved from Cassandra to Amazon DynamoDB 2013-10-02T12:55:23+00:00 http://attentionshard.wordpress.com/2013/09/30/why-tellybug-moved-from-cassandra-to-amazon-dynamodb/ jm aws dynamodb cassandra nosql storage tellybug counters scalability reliability latency https://pinboard.in/ https://pinboard.in/u:jm/b:8b0a38474b92/ Behind the Screens at Loggly 2013-09-09T21:11:34+00:00 http://www.loggly.com/behind-the-screens/ jm boost scalability loggly logging ingestion cep stream-processing kafka storm architecture elasticsearch https://pinboard.in/ https://pinboard.in/u:jm/b:6dca6cd9245d/ _MillWheel: Fault-Tolerant Stream Processing at Internet Scale_ [paper, pdf] 2013-08-29T23:13:55+00:00 http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p734-akidau.pdf jm MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees. This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google. ]]> millwheel google data-processing cep low-latency fault-tolerance scalability papers event-processing stream-processing https://pinboard.in/ https://pinboard.in/u:jm/b:a3c789df54bc/ New Tweets per second record, and how | Twitter Blog 2013-08-17T08:21:13+00:00 https://blog.twitter.com/2013/new-tweets-per-second-record-and-how jm twitter performance scalability jvm ruby soa scaling https://pinboard.in/ https://pinboard.in/u:jm/b:61fa933c4f21/ Building a panopticon: The evolution of the NSA’s XKeyscore 2013-08-09T14:10:18+00:00 http://arstechnica.com/information-technology/2013/08/building-a-panopticon-the-evolution-of-the-nsas-xkeyscore/ jm panopticon xkeyscore nsa architecture scalability packet-capture narus sniffing snooping interception lawful-interception li tapping https://pinboard.in/ https://pinboard.in/u:jm/b:dd0ed3afe027/ The Architecture Twitter Uses to Deal with 150M Active Users, 300K QPS, a 22 MB/S Firehose, and Send Tweets in Under 5 Seconds 2013-07-09T09:01:05+00:00 http://highscalability.com/blog/2013/7/8/the-architecture-twitter-uses-to-deal-with-150m-active-users.html jmTwitter is primarily a consumption mechanism, not a production mechanism. 300K QPS are spent reading timelines and only 6000 requests per second are spent on writes. * their approach of precomputing the timeline for the non-search case is a good example of optimizing for the more frequently-exercised path. * MySQL and Redis are the underlying stores. Redis is acting as a front-line in-RAM cache. they're pretty happy with it: https://news.ycombinator.com/item?id=6011254 * these further talks go into more detail, apparently (haven't watched them yet): http://www.infoq.com/presentations/Real-Time-Delivery-Twitter http://www.infoq.com/presentations/Twitter-Timeline-Scalability http://www.infoq.com/presentations/Timelines-Twitter * funny thread of comments on HN, from a big-iron fan: https://news.ycombinator.com/item?id=6008228]]> scale architecture scalability twitter high-scalability redis mysql https://pinboard.in/ https://pinboard.in/u:jm/b:5bddc42e545c/ Facebook announce Wormhole 2013-06-26T09:38:27+00:00 https://www.facebook.com/notes/facebook-engineering/wormhole-pubsub-system-moving-data-through-space-and-time/10151504075843920 jmOver the last couple of years, we have built and deployed a reliable publish-subscribe system called Wormhole. Wormhole has become a critical part of Facebook's software infrastructure. At a high level, Wormhole propagates changes issued in one system to all systems that need to reflect those changes – within and across data centers. Facebook's Kafka-alike, basically, although with some additional low-latency guarantees. FB appear to be using it for multi-region and multi-AZ replication. Proprietary.]]> pub-sub scalability facebook realtime low-latency multi-region replication multi-az wormhole https://pinboard.in/ https://pinboard.in/u:jm/b:c16235547374/ Building a Modern Website for Scale (QCon NY 2013) [slides] 2013-06-17T10:37:00+00:00 http://www.slideshare.net/r39132/q-con-ny2013modernwebsitescalabilityfinal-22989785 jm gc-scout gc java scaling scalability linkedin qcon async threadpools rest slas timeouts networking distcomp netty tcp udp failover fault-tolerance packet-loss https://pinboard.in/ https://pinboard.in/u:jm/b:8766348f43f5/ Martin Thompson, Luke "Snabb Switch" Gorrie etc. review the C10M presentation from Schmoocon 2013-05-15T09:56:39+00:00 https://groups.google.com/forum/#!topic/mechanical-sympathy/ao44gonVdAY jmThis talk has some good points and I think the subject is really interesting. I would take the suggested approach with serious caution. For starters the Linux kernel is nowhere near as bad as it made out. Last year I worked with a client and we scaled a single server to 1 million concurrent connections with async programming in Java and some sensible kernel tuning. I've heard they have since taken this to over 5 million concurrent connections. BTW Open Onload is an open source implementation. Writing a network stack is a serious undertaking. In a previous life I wrote a network probe and had to reassemble TCP streams and kept getting tripped up by edge cases. It is a great exercise in data structures and lock-free programming. If you need very high-end performance I'd talk to the Solarflare or Mellanox guys before writing my own. There are some errors and omissions in this talk. For example, his range of ephemeral ports is not quite right, and atomic operations are only 15 cycles on Sandy Bridge when hitting local cache. A big issue for me is when he defined C10M he did not mention the TIME_WAIT issue with closing connections. Creating and destroying 1 million connections per second is a major issue. A protocol like HTTP is very broken in that the server closes the socket and therefore has to retain the TCB until the specified timeout occurs to ensure no older packet is delivered to a new socket connection. ]]> mechanical-sympathy hardware scaling c10m tcp http scalability snabb-switch martin-thompson https://pinboard.in/ https://pinboard.in/u:jm/b:dfe0a86b2ec0/ CAP Confusion: Problems with ‘partition tolerance’ 2013-05-14T20:18:13+00:00 http://blog.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/ jmSo what causes partitions? Two things, really. The first is obvious – a network failure, for example due to a faulty switch, can cause the network to partition. The other is less obvious, but fits with the definition [...]: machine failures, either hard or soft. In an asynchronous network, i.e. one where processing a message could take unbounded time, it is impossible to distinguish between machine failures and lost messages. Therefore a single machine failure partitions it from the rest of the network. A correlated failure of several machines partitions them all from the network. Not being able to receive a message is the same as the network not delivering it. In the face of sufficiently many machine failures, it is still impossible to maintain availability and consistency, not because two writes may go to separate partitions, but because the failure of an entire ‘quorum’ of servers may render some recent writes unreadable. (sorry, catching up on old interesting things posted last week...)]]> failure scalability network partitions cap quorum distributed-databases fault-tolerance https://pinboard.in/ https://pinboard.in/u:jm/b:aa948fa8adc0/ Alex Feinberg's response to Damien Katz' anti-Dynamoish/pro-Couchbase blog post 2013-05-14T20:16:19+00:00 https://news.ycombinator.com/item?id=5653266 jmwhile you are saving on read traffic (online reads only go to the master), you are now decreasing availability (contrary to your stated goal), and increasing system complexity. You also do hurt performance by requiring all writes and reads to be serialized through a single node: unless you plan to have a leader election whenever the node fails to meet a read SLA (which is going to result a disaster -- I am speaking from personal experience), you will have to accept that you're bottlenecked by a single node. With a Dynamo-style quorum (for either reads or writes), a single straggler will not reduce whole-cluster latency. The core point of Dynamo is low latency, availability and handling of all kinds of partitions: whether clean partitions (long term single node failures), transient failures (garbage collection pauses, slow disks, network blips, etc...), or even more complex dependent failures. The reality, of course, is that availability is neither the sole, nor the principal concern of every system. It's perfect fine to trade off availability for other goals -- you just need to be aware of that trade off.]]> cap distributed-databases databases quorum availability scalability damien-katz alex-feinberg partitions network dynamo riak voldemort couchbase https://pinboard.in/ https://pinboard.in/u:jm/b:87fd2f70fea6/ DataSift Architecture: Realtime Datamining at 120,000 Tweets Per Second 2013-04-23T13:03:14+00:00 http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html jm datasift architecture scalability data twitter firehose hbase kafka zeromq https://pinboard.in/ https://pinboard.in/u:jm/b:5c07ab4273cd/ Latency's Worst Nightmare: Performance Tuning Tips and Tricks [slides] 2013-04-19T20:27:52+00:00 https://speakerdeck.com/mza/latencys-worst-nightmare-performance-tuning-tips-and-tricks jm benchmarks aws ec2 ebs piops services scaling scalability presentations https://pinboard.in/ https://pinboard.in/u:jm/b:0dad472cef4b/ High Scalability - Scaling Pinterest - From 0 to 10s of Billions of Page Views a Month in Two Years 2013-04-15T21:17:02+00:00 http://highscalability.com/blog/2013/4/15/scaling-pinterest-from-0-to-10s-of-billions-of-page-views-a.html jma [Cassandra-style] Cluster Management Algorithm is a SPOF. If there’s a bug it impacts every node. This took them down 4 times. yeah, so, eek ;)]]> clustering sharding architecture aws scalability scaling pinterest via:matt-sergeant redis mysql memcached https://pinboard.in/ https://pinboard.in/u:jm/b:94eb7274d2de/