Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmTrino on Ice IV: Deep Dive Into Iceberg Internals2021-06-09T09:01:17+00:00
https://blog.starburst.io/trino-on-ice-iv-deep-dive-into-iceberg-internals
jmtrino iceberg data big-data data-lakes formats s3 avro orchttps://pinboard.in/https://pinboard.in/u:jm/b:ccbdfd8f2e06/FlexBuffers | Hacker News2020-06-22T13:49:53+00:00
https://news.ycombinator.com/item?id=23588558
jmflatbuffers flexbuffers json encoding data formats file-formats avro protobuf zerocopy sbe schemashttps://pinboard.in/https://pinboard.in/u:jm/b:567d7f5724e6/Schema evolution in Avro, Protocol Buffers and Thrift2016-01-29T13:52:24+00:00
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
jmavro thrift protobuf schemas serialization coding interop compatibilityhttps://pinboard.in/https://pinboard.in/u:jm/b:b9648b78d5fd/Avro, mail # dev - bytes and fixed handling in Python implementation - 2014-09-04, 22:542015-03-31T16:12:18+00:00
http://search-hadoop.com/m/icC8xQfO8
jmbytes avro marshalling fail bugs python json utf-8https://pinboard.in/https://pinboard.in/u:jm/b:5d525ba927da/Kafka best practices2015-03-26T10:09:00+00:00
http://blog.confluent.io/2015/02/25/stream-data-platform-2/
jmThis is the second part of our guide on streaming data and Apache Kafka. In part one I talked about the uses for real-time data streams and explained our idea of a stream data platform. The remainder of this guide will contain specific advice on how to go about building a stream data platform in your organization.
tl;dr: limit the number of Kafka clusters; use Avro.
]]>architecture kafka storage streaming event-processing avro schema confluent best-practices tipshttps://pinboard.in/https://pinboard.in/u:jm/b:50e6d4f91b09/The problem of managing schemas2014-11-10T15:51:12+00:00
http://radar.oreilly.com/2014/11/the-problem-of-managing-schemas.html
jmeventually, the schema changes. Someone refactors the code generating the JSON and moves fields around, perhaps renaming few fields. The DBA added new columns to a MySQL table and this reflects in the CSVs dumped from the table. Now all those applications and scripts must be modified to handle both file formats. And since schema changes happen frequently, and often without warning, this results in both ugly and unmaintainable code, and in grumpy developers who are tired of having to modify their scripts again and again.
]]>schema json avro protobuf csv data-formats interchange data hadoop files file-formatshttps://pinboard.in/https://pinboard.in/u:jm/b:aa65575df405/Integrating Kafka and Spark Streaming: Code Examples and State of the Game2014-10-06T15:21:27+00:00
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/
jmSpark Streaming has been getting some attention lately as a real-time data processing tool, often mentioned alongside Apache Storm. [...] I added an example Spark Streaming application to kafka-storm-starter that demonstrates how to read from Kafka and write to Kafka, using Avro as the data format and Twitter Bijection for handling the data serialization. In this post I will explain this Spark Streaming example in further detail and also shed some light on the current state of Kafka integration in Spark Streaming. All this with the disclaimer that this happens to be my first experiment with Spark Streaming.
]]>spark kafka realtime architecture queues avro bijection batch-processinghttps://pinboard.in/https://pinboard.in/u:jm/b:2ff12c1b1ca4/Vlnt2009-11-04T23:16:01+00:00
http://lucene.apache.org/java/2_4_0/fileformats.html#VInt
jmutf8 compression utf lucene avro hadoop java fomats numerichttps://pinboard.in/u:jm/b:886f6e667a47/