Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmPrio2020-09-24T11:23:08+00:00
https://crypto.stanford.edu/prio/paper.pdf
jmnizk zero-knowledge snark prio crypto privacy data-privacy statistics quantiles percentiles aggregationhttps://pinboard.in/https://pinboard.in/u:jm/b:2aa675a6dbfc/Circllhist2020-02-10T11:59:56+00:00
https://arxiv.org/pdf/2001.06561.pdf
jmhistograms aggregation quantiles percentiles measurement graphs data-structures summaries latency monitoring approximation papershttps://pinboard.in/https://pinboard.in/u:jm/b:763bb560c156/Google release an open-source differential-privacy lib2019-09-09T14:24:03+00:00
https://developers.googleblog.com/2019/09/enabling-developers-and-organizations.html
jmDifferentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual's data to be distinguished or re-identified. This type of analysis can be implemented in a wide variety of ways and for many different purposes. For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner.
Currently, we provide algorithms to compute the following:
Count
Sum
Mean
Variance
Standard deviation
Order statistics (including min, max, and median)
]]>analytics google ml privacy differential-privacy aggregation statistics obfuscation approximation algorithmshttps://pinboard.in/https://pinboard.in/u:jm/b:98439e468432/To Cite or to Steal? When a Scholarly Project Turns Up in a Gallery2017-05-20T20:21:01+00:00
https://hyperallergic.com/308436/to-cite-or-to-steal-when-a-scholarly-project-turns-up-in-a-gallery/
jmWhat I was seeing was an announcement for a show by Jason Shulman at Cob Gallery called Photographs of Films. The press and interviews collected on the gallery’s website lauded a conceptual beauty and rigor in his work, but the only thing I could see was a rip-off. “Email for price list.” These images were unmistakably similar to the distinctive work I had been producing for years, and it was not long before friends started writing to let me know.
]]>copyright art aggregation averaging images movies rip-offs jason-shulman jason-salavon kevin-l-fergusonhttps://pinboard.in/https://pinboard.in/u:jm/b:18ec7c7b40ca/ASAP: Automatic Smoothing for Attention Prioritization in Streaming Time Series Visualization2017-03-15T11:12:49+00:00
https://arxiv.org/pdf/1703.00983.pdf
jmdataviz graphs metrics peter-bailis asap smoothing aggregation time-series tsdhttps://pinboard.in/https://pinboard.in/u:jm/b:e26e62cfb964/tdunning/t-digest2016-12-12T12:28:16+00:00
https://github.com/tdunning/t-digest
jmA new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.
The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.
Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.]]>data-structures algorithms java t-digest statistics quantiles percentiles aggregation digests estimation rankinghttps://pinboard.in/https://pinboard.in/u:jm/b:aaf9fb613f21/New study shows Spain’s “Google tax” has been a disaster for publishers2015-08-05T12:46:43+00:00
http://arstechnica.com/tech-policy/2015/07/new-study-shows-spains-google-tax-has-been-a-disaster-for-publishers/
jmA study commissioned by Spanish publishers has found that a new intellectual property law passed in Spain last year, which charges news aggregators like Google for showing snippets and linking to news stories, has done substantial damage to the Spanish news industry.
In the short-term, the study found, the law will cost publishers €10 million, or about $10.9 million, which would fall disproportionately on smaller publishers. Consumers would experience a smaller variety of content, and the law "impedes the ability of innovation to enter the market." The study concludes that there's no "theoretical or empirical justification" for the fee.
]]>google news publishing google-tax spain law aggregation snippets economicshttps://pinboard.in/https://pinboard.in/u:jm/b:cbaed4a17c76/Ask the Decoder: Did I sign up for a global sleep study?2015-03-09T17:34:21+00:00
http://america.aljazeera.com/articles/2014/10/29/sleep-study.html
jmHow meaningful is this corporate data science, anyway? Given the tech-savvy people in the Bay Area, Jawbone likely had a very dense sample of Jawbone wearers to draw from for its Napa earthquake analysis. That allowed it to look at proximity to the epicenter of the earthquake from location information.
Jawbone boasts its sample population of roughly “1 million Up wearers who track their sleep using Up by Jawbone.” But when looking into patterns county by county in the U.S., Jawbone states, it takes certain statistical liberties to show granularity while accounting for places where there may not be many Jawbone users.
So while Jawbone data can show us interesting things about sleep patterns across a very large population, we have to remember how selective that population is. Jawbone wearers are people who can afford a $129 wearable fitness gadget and the smartphone or computer to interact with the output from the device.
Jawbone is sharing what it learns with the public, but think of all the public health interests or other third parties that might be interested in other research questions from a large scale data set. Yet this data is not collected with scientific processes and controls and is not treated with the rigor and scrutiny that a scientific study requires.
Jawbone and other fitness trackers don’t give us the option to use their devices while opting out of contributing to the anonymous data sets they publish. Maybe that ought to change.
]]>jawbone privacy data-protection anonymization aggregation data medicine health earthquakes statistics iot wearableshttps://pinboard.in/https://pinboard.in/u:jm/b:b2cff21e8284/When data gets creepy: the secrets we don’t realise we’re giving away | Technology | The Guardian2014-12-06T10:45:05+00:00
http://www.theguardian.com/technology/2014/dec/05/when-data-gets-creepy-secrets-were-giving-away
jmWe are entering an age – which we should welcome with open arms – when patients will finally have access to their own full medical records online. So suddenly we have a new problem. One day, you log in to your medical records, and there’s a new entry on your file: “Likely to die in the next year.” We spend a lot of time teaching medical students to be skilful around breaking bad news. A box ticked on your medical records is not empathic communication. Would we hide the box? Is that ethical? Or are “derived variables” such as these, on a medical record, something doctors should share like anything else?
]]>advertising ethics privacy security law data aggregation metadata ben-goldacrehttps://pinboard.in/https://pinboard.in/u:jm/b:fcd3a5f979b7/Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale | Engineering Blog | Facebook Code2014-11-28T16:47:33+00:00
https://code.facebook.com/posts/1499322996995183/solving-the-mystery-of-link-imbalance-a-metastable-failure-state-at-scale/
jmFacebook collocates many of a user’s nodes and edges in the social graph. That means that when somebody logs in after a while and their data isn’t in the cache, we might suddenly perform 50 or 100 database queries to a single database to load their data. This starts a race among those queries. The queries that go over a congested link will lose the race reliably, even if only by a few milliseconds. That loss makes them the most recently used when they are put back in the pool. The effect is that during a query burst we stack the deck against ourselves, putting all of the congested connections at the top of the deck.
]]>architecture debugging devops facebook layer-7 mysql connection-pooling aggregation networking tcp-stackhttps://pinboard.in/https://pinboard.in/u:jm/b:c71d54137cff/Twitter's TSAR2014-06-30T10:47:02+00:00
https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
jmanalytics architecture twitter tsar aggregation event-processing metrics streaming hadoop batchhttps://pinboard.in/https://pinboard.in/u:jm/b:663be2fcb029/Streaming MapReduce with Summingbird2013-09-03T20:47:29+00:00
https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
jmBefore Summingbird at Twitter, users that wanted to write production streaming aggregations would typically write their logic using a Hadoop DSL like Pig or Scalding. These tools offered nice distributed system abstractions: Pig resembled familiar SQL, while Scalding, like Summingbird, mimics the Scala collections API. By running these jobs on some regular schedule (typically hourly or daily), users could build time series dashboards with very reliable error bounds at the unfortunate cost of high latency.
While using Hadoop for these types of loads is effective, Twitter is about real-time and we needed a general system to deliver data in seconds, not hours. Twitter’s release of Storm made it easy to process data with very low latencies by sacrificing Hadoop’s fault tolerant guarantees. However, we soon realized that running a fully real-time system on Storm was quite difficult for two main reasons:
Recomputation over months of historical logs must be coordinated with Hadoop or streamed through Storm with a custom log loading mechanism;
Storm is focused on message passing and random-write databases are harder to maintain.
The types of aggregations one can perform in Storm are very similar to what’s possible in Hadoop, but the system issues are very different. Summingbird began as an investigation into a hybrid system that could run a streaming aggregation in both Hadoop and Storm, as well as merge automatically without special consideration of the job author. The hybrid model allows most data to be processed by Hadoop and served out of a read-only store. Only data that Hadoop hasn’t yet been able to process (data that falls within the latency window) would be served out of a datastore populated in real-time by Storm. But the error of the real-time layer is bounded, as Hadoop will eventually get around to processing the same data and will smooth out any error introduced. This hybrid model is appealing because you get well understood, transactional behavior from Hadoop, and up to the second additions from Storm. Despite the appeal, the hybrid approach has the following practical problems:
Two sets of aggregation logic have to be kept in sync in two different systems;
Keys and values must be serialized consistently between each system and the client.
The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results
Summingbird was developed to provide a general solution to these problems.
Very interesting stuff. I'm particularly interested in the design constraints they've chosen to impose to achieve this -- data formats which require associative merging in particular.]]>mapreduce streaming big-data twitter storm summingbird scala pig hadoop aggregation merginghttps://pinboard.in/https://pinboard.in/u:jm/b:3ab5abb07e1c/Distributed Streams Algorithms for Sliding Windows [PDF]2013-02-21T11:16:34+00:00
http://home.engineering.iastate.edu/~snt/pubs/tocs04.pdf
jmwaves papers streaming algorithms percentiles histogram distcomp distributed aggregation statistics estimation streamshttps://pinboard.in/https://pinboard.in/u:jm/b:dcc1a04930b0/HBase Real-time Analytics & Rollbacks via Append-based Updates2012-12-17T13:59:22+00:00
http://blog.sematext.com/2012/04/22/hbase-real-time-analytics-rollbacks-via-append-based-updates/
jm'Replace update (Get+Put) operations at write time with simple append-only writes and defer processing of updates to periodic jobs or perform aggregations on the fly if user asks for data earlier than individual additions are processed. The idea is simple and not necessarily novel, but given the specific qualities of HBase, namely fast range scans and high write throughput, this approach works very well.'
]]>counters analytics hbase append sematext aggregation big-datahttps://pinboard.in/https://pinboard.in/u:jm/b:158d78e4914f/'Ireland' Reddit2010-10-01T12:41:40+00:00
http://www.reddit.com/r/ireland
jmireland reddit local news aggregationhttps://pinboard.in/u:jm/b:24ae055bb130/