Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmPaper review: "Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems"2015-03-27T09:36:04+00:00
http://muratbuffalo.blogspot.co.uk/2015/03/paper-review-simple-testing-can-prevent.html
jmrace-conditions startup bugs failure fault-tolerance hbase redis reliability ops papers concurrency exception-handling cassandra hdfs mapreducehttps://pinboard.in/https://pinboard.in/u:jm/b:3dd7b48e5fed/Streaming MapReduce with Summingbird2013-09-03T20:47:29+00:00
https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
jmBefore Summingbird at Twitter, users that wanted to write production streaming aggregations would typically write their logic using a Hadoop DSL like Pig or Scalding. These tools offered nice distributed system abstractions: Pig resembled familiar SQL, while Scalding, like Summingbird, mimics the Scala collections API. By running these jobs on some regular schedule (typically hourly or daily), users could build time series dashboards with very reliable error bounds at the unfortunate cost of high latency.
While using Hadoop for these types of loads is effective, Twitter is about real-time and we needed a general system to deliver data in seconds, not hours. Twitter’s release of Storm made it easy to process data with very low latencies by sacrificing Hadoop’s fault tolerant guarantees. However, we soon realized that running a fully real-time system on Storm was quite difficult for two main reasons:
Recomputation over months of historical logs must be coordinated with Hadoop or streamed through Storm with a custom log loading mechanism;
Storm is focused on message passing and random-write databases are harder to maintain.
The types of aggregations one can perform in Storm are very similar to what’s possible in Hadoop, but the system issues are very different. Summingbird began as an investigation into a hybrid system that could run a streaming aggregation in both Hadoop and Storm, as well as merge automatically without special consideration of the job author. The hybrid model allows most data to be processed by Hadoop and served out of a read-only store. Only data that Hadoop hasn’t yet been able to process (data that falls within the latency window) would be served out of a datastore populated in real-time by Storm. But the error of the real-time layer is bounded, as Hadoop will eventually get around to processing the same data and will smooth out any error introduced. This hybrid model is appealing because you get well understood, transactional behavior from Hadoop, and up to the second additions from Storm. Despite the appeal, the hybrid approach has the following practical problems:
Two sets of aggregation logic have to be kept in sync in two different systems;
Keys and values must be serialized consistently between each system and the client.
The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results
Summingbird was developed to provide a general solution to these problems.
Very interesting stuff. I'm particularly interested in the design constraints they've chosen to impose to achieve this -- data formats which require associative merging in particular.]]>mapreduce streaming big-data twitter storm summingbird scala pig hadoop aggregation merginghttps://pinboard.in/https://pinboard.in/u:jm/b:3ab5abb07e1c/CloudBurst2012-07-23T10:02:33+00:00
http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php?title=CloudBurst
jmCloudBurst uses well-known seed-and-extend algorithms to map reads to a reference genome. It can map reads with any number of differences or mismatches. [..] Given an exact seed, CloudBurst attempts to extend the alignment into an end-to-end alignment with at most k mismatches or differences by either counting mismatches of the two sequences, or with a dynamic programming algorithm to allow for gaps. CloudBurst uses [Hadoop] to catalog and extend the seeds. In the map phase, the map function emits all length-s k-mers from the reference sequences, and all non-overlapping length-s kmers from the reads. In the shuffle phase, read and reference kmers are brought together. In the reduce phase, the seeds are extended into end-to-end alignments. The power of MapReduce and CloudBurst is the map and reduce functions run in parallel over dozens or hundreds of processors.
JM_SOUGHT -- the next generation ;)]]>bioinformatics mapreduce hadoop read-alignment dna sequencing sought antispam algorithmshttps://pinboard.in/https://pinboard.in/u:jm/b:d5988af4cc00/MapReduce Patterns, Algorithms, and Use Cases2012-02-16T11:56:48+00:00
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
jmalgorithms hadoop java mapreduce patterns distcomphttps://pinboard.in/https://pinboard.in/u:jm/b:cdc3829d348a/MapReduce as a way to cope with high-latency memory2010-06-21T11:21:46+00:00
http://lists.canonical.org/pipermail/kragen-tol/2010-June/000917.html
jmkragen thoughts random mapreduce memory speed latencyhttps://pinboard.in/u:jm/b:e82fa500b315/