Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmUK COVID vaccination modelling was dependent on a single Pythonista2024-02-12T16:11:24+00:00
https://christinapagel.substack.com/p/where-are-we-with-covid-in-england
jmexcel python modelling statistics uk ukhsa qa covid-19 quality-controlhttps://pinboard.in/https://pinboard.in/u:jm/b:67997cd0272c/Turning Poetry into Art: Joanne McNeil on Large Language Models and the Poetry of Allison Parrish | Filmmaker Magazine2023-07-31T16:13:48+00:00
https://filmmakermagazine.com/121867-joanne-mcneil-large-language-models-allison-parrish/
jmParrish has long thought of her work in conversation with Oulipo and other avant-garde movements, “using randomness to produce juxtapositions of concepts to make you think more deeply about the language that you’re using.” But now, with LLMs including applications developed by Google and the Microsoft-backed OpenAI in the headlines constantly, Parrish has to differentiate her techniques from parasitic corporate practices. “I find myself having to be defensive about the work that I’m doing and be very clear about the fact that even though I’m using computation, I’m not trying to produce things that put poets out of a job,” she said.
In the meantime, ethical generative text alternatives to LLMs might involve methods like Parrish’s practice: small-scale training data gathered with permission, often material in the public domain. “Just because something’s in the public domain doesn’t necessarily mean that it’s ethical to use it, but it’s a good starting point,” Parrish told me. ...
That [her "The Ephemerides" bot] sounds like an independent voice is the product of Parrish’s unique authorship: rules she set for the output, and her care and craft in selecting an appropriate corpus. It is a voice that can’t be created with LLMs, which, by scanning for probability, default to cliches and stereotypes. “They’re inherently conservative,” Parrish said. “They encode the past, literally. That’s what they’re doing with these data sets.”
]]>ai poetry ml statistics alison-parrish art poems generative-art text randomnesshttps://pinboard.in/https://pinboard.in/u:jm/b:7601debb0064/Latest Long Covid estimates2022-10-19T10:07:45+00:00
https://jamanetwork.com/journals/jama/fullarticle/2797443
jmA total of 1.2 million individuals who had symptomatic SARS-CoV-2 infection were included (mean age, 4-66 years; males, 26%-88%). In the modeled estimates, 6.2% (95% uncertainty interval [UI], 2.4%-13.3%) of individuals who had symptomatic SARS-CoV-2 infection experienced at least 1 of the 3 Long COVID symptom clusters in 2020 and 2021, including 3.2% (95% UI, 0.6%-10.0%) for persistent fatigue with bodily pain or mood swings, 3.7% (95% UI, 0.9%-9.6%) for ongoing respiratory problems, and 2.2% (95% UI, 0.3%-7.6%) for cognitive problems after adjusting for health status before COVID-19, comprising an estimated 51.0% (95% UI, 16.9%-92.4%), 60.4% (95% UI, 18.9%-89.1%), and 35.4% (95% UI, 9.4%-75.1%), respectively, of Long COVID cases. The Long COVID symptom clusters were more common in women aged 20 years or older (10.6% [95% UI, 4.3%-22.2%]) 3 months after symptomatic SARS-CoV-2 infection than in men aged 20 years or older (5.4% [95% UI, 2.2%-11.7%]). Both sexes younger than 20 years of age were estimated to be affected in 2.8% (95% UI, 0.9%-7.0%) of symptomatic SARS-CoV-2 infections. The estimated mean Long COVID symptom cluster duration was 9.0 months (95% UI, 7.0-12.0 months) among hospitalized individuals and 4.0 months (95% UI, 3.6-4.6 months) among nonhospitalized individuals. Among individuals with Long COVID symptoms 3 months after symptomatic SARS-CoV-2 infection, an estimated 15.1% (95% UI, 10.3%-21.1%) continued to experience symptoms at 12 months.
]]>long-covid statistics disease covid-19 papers jama disabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:cf65bc10f43a/The model used to simulate the Irish COVID-19 response2021-07-01T08:39:12+00:00
https://twitter.com/President_MU/status/1410315246791802884
jmmodelling data-science statistics epidemiology pandemics covid-19 sars-cov-2 ireland philip-nolan seirhttps://pinboard.in/https://pinboard.in/u:jm/b:21e859682f26/want to ace an AI-based interview? add a bookshelf in the background2021-02-22T10:59:18+00:00
https://twitter.com/hatr/status/1361756449802768387
jmcorrelation clever-hans funny ai ml interviewing statistics phrenologyhttps://pinboard.in/https://pinboard.in/u:jm/b:275724d41508/Prio2020-09-24T11:23:08+00:00
https://crypto.stanford.edu/prio/paper.pdf
jmnizk zero-knowledge snark prio crypto privacy data-privacy statistics quantiles percentiles aggregationhttps://pinboard.in/https://pinboard.in/u:jm/b:2aa675a6dbfc/Sweden has the smallest average household size in Europe2020-09-21T10:14:59+00:00
https://twitter.com/AdamJKucharski/status/1307958852248272898
jmcovid-19 sweden households europe statistics eu housinghttps://pinboard.in/https://pinboard.in/u:jm/b:940ab75a2d35/illustration of how a rise in SARS-CoV-2 positivity in younger groups can soon become a rise in older groups2020-09-08T09:26:06+00:00
https://twitter.com/vincentglad/status/1303243869933404161/photo/1
jmtesting covid-19 age epidemiology dataviz statistics marseilles francehttps://pinboard.in/https://pinboard.in/u:jm/b:7ed9065e39e7/A-Levels: The Model is not the Student2020-08-17T11:19:06+00:00
http://thaines.com/post/alevels2020
jmofqual schools marks exams a-levels uk estimation statistics maths failhttps://pinboard.in/https://pinboard.in/u:jm/b:f95b2089f16c/Interpreting Covid-19 Test Results: A Bayesian Approach2020-06-13T22:42:29+00:00
https://medium.com/@Bob_Wachter/interpreting-covid-19-test-results-a-bayesian-approach-df058dad2ade
jma brief tutorial on Covid-19 testing, with an emphasis on a Bayesian approach. After presenting the basics, we’ll walk through four confusing Covid-19 testing scenarios, just to give you a feel for the kinds of pickles we often find ourselves in.
]]>prevalence covid-19 bayes bayesian statistics testinghttps://pinboard.in/https://pinboard.in/u:jm/b:419f48573a05/on COVID-19 death rate statistics2020-04-22T12:01:00+00:00
https://twitter.com/Care2much18/status/1252819591090155523
jmcovid-19 statistics lies-damn-lies death-rates comorbidity diseases europe deathshttps://pinboard.in/https://pinboard.in/u:jm/b:a34779538059/Coronavirus: Why You Must Act Now - Tomas Pueyo - Medium2020-03-11T10:30:15+00:00
https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca
jmcoronavirus covid19 healthcare epidemiology diseases statisticshttps://pinboard.in/https://pinboard.in/u:jm/b:afc303b28ce7/Want To Make Money? Build A Business On A Bike Lane2019-11-25T11:30:25+00:00
https://www.fastcompany.com/90182112/want-to-make-money-build-a-business-on-a-bike-lane
jmnumbers statistics cycling bike-lanes shopshttps://pinboard.in/https://pinboard.in/u:jm/b:4a8cc7b40f48/electricityMap2019-10-07T16:33:35+00:00
https://www.electricitymap.org/?page=country&solar=false&remote=true&wind=false&countryCode=IE
jmelectricity statistics graphs data energy climate renewables carbon co2https://pinboard.in/https://pinboard.in/u:jm/b:4a92fe5beab6/Google release an open-source differential-privacy lib2019-09-09T14:24:03+00:00
https://developers.googleblog.com/2019/09/enabling-developers-and-organizations.html
jmDifferentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual's data to be distinguished or re-identified. This type of analysis can be implemented in a wide variety of ways and for many different purposes. For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner.
Currently, we provide algorithms to compute the following:
Count
Sum
Mean
Variance
Standard deviation
Order statistics (including min, max, and median)
]]>analytics google ml privacy differential-privacy aggregation statistics obfuscation approximation algorithmshttps://pinboard.in/https://pinboard.in/u:jm/b:98439e468432/France Bans Judge Analytics, 5 Years In Prison For Rule Breakers2019-06-05T10:56:24+00:00
https://www.artificiallawyer.com/2019/06/04/france-bans-judge-analytics-5-years-in-prison-for-rule-breakers/
jm
‘The identity data of magistrates and members of the judiciary cannot be reused with the purpose or effect of evaluating, analysing, comparing or predicting their actual or alleged professional practices.’
As far as Artificial Lawyer understands, this is the very first example of such a ban anywhere in the world. Insiders in France told Artificial Lawyer that the new law is a direct result of an earlier effort to make all case law easily accessible to the general public, which was seen at the time as improving access to justice and a big step forward for transparency in the justice sector.
However, judges in France had not reckoned on NLP and machine learning companies taking the public data and using it to model how certain judges behave in relation to particular types of legal matter or argument, or how they compare to other judges.
In short, they didn’t like how the pattern of their decisions – now relatively easy to model – were potentially open for all to see.
]]>censorship france analytics judgements legal judges statisticshttps://pinboard.in/https://pinboard.in/u:jm/b:70c20892a220/[1902.04023] Computing Extremely Accurate Quantiles Using t-Digests2019-02-18T11:05:52+00:00
https://arxiv.org/abs/1902.04023
jmjava go python open-source quantiles percentiles approximation statistics sketching algorithms via:fanfhttps://pinboard.in/https://pinboard.in/u:jm/b:6c84ec8a0947/_AI Ethics, Impossibility Theorems and Tradeoffs_2019-01-28T16:14:07+00:00
https://www.chrisstucchio.com/pubs/slides/crunchconf_2018/slides.pdf
jmdiscrimination ethics racism race ai statistics compas machine-learninghttps://pinboard.in/https://pinboard.in/u:jm/b:bce0c1fb31f5/Some facts on immigration to Ireland2019-01-15T22:15:29+00:00
https://notesonthefront.typepad.com/politicaleconomy/2019/01/the-far-rights-problem-with-immigration-facts.html?fbclid=IwAR2MJkON4vAnWTuPsgFdop61--X0bndsmQd2SZz6ZIo8Jw2uWsGETGeP4Xc
jmLet’s summarise:
Ireland has a relatively high level of non-citizens in its population. But this is down to the high level of UK citizens and citizens from other English-speaking countries (US, Canada, Australia and New Zealand).
Ireland has significantly fewer non-citizens from outside the English-speaking world than high-income EU countries.
The proportion of non-citizens has remained stable over the last 10 years (i.e. there is no ‘surge’).
Non-citizens in Ireland are more integrated into the labour market than any other high-income EU country – that is, there is lower unemployment among non-citizens. So much for the ‘sponging-off-the-state’ argument.
We have had far fewer asylum-seekers and we grant asylum to far fewer than most other high-income EU countries.
The claims of the Far Right and their allies collapse when we look to reality.
]]>immigration facts statistics ireland asylum-seekershttps://pinboard.in/https://pinboard.in/u:jm/b:8657155c24f0/Surprisingly Little Evidence for the Accepted Wisdom About Teeth - The New York Times2018-09-14T13:55:42+00:00
https://www.nytimes.com/2016/08/30/upshot/surprisingly-little-evidence-for-the-usual-wisdom-about-teeth.html
jmA systematic review in 2011 concluded that, in adults, toothbrushing with flossing versus toothbrushing alone most likely reduced gingivitis, or inflammation of the gums. But there was really weak evidence that it reduced plaque in the short term. There was no evidence that it reduced cavities. That’s pretty much what we learned recently.
]]>teeth dentistry dental health medicine statistics sciencehttps://pinboard.in/https://pinboard.in/u:jm/b:79ed3cc85695/ICE's Risk Classification Assessment turned into a digital rubber stamp2018-06-26T22:39:40+00:00
https://www.reuters.com/investigates/special-report/usa-immigration-court/
jmTo conform to Trump’s policies, Reuters has learned, ICE modified a tool officers have been using since 2013 when deciding whether an immigrant should be detained or released on bond. The computer-based Risk Classification Assessment uses statistics to determine an immigrant’s flight risk and danger to society.
Previously, the tool automatically recommended either “detain” or “release.” Last year, ICE spokesman Bourke said, the agency removed the “release” recommendation.
More: https://motherboard.vice.com/en_us/article/evk3kw/ice-modified-its-risk-assessment-software-so-it-automatically-recommends-detention
]]>immigration statistics machine-learning rubber-stamping fake-algorithms whitewashing ice us-politicshttps://pinboard.in/https://pinboard.in/u:jm/b:52d02fbdf434/A Closer Look at Experian Big Data and Artificial Intelligence in Durham Police2018-04-09T10:27:17+00:00
https://bigbrotherwatch.org.uk/2018/04/a-closer-look-at-experian-big-data-and-artificial-intelligence-in-durham-police/
jmexperian marketing credit-score data policing uk durham ai statistics crime harthttps://pinboard.in/https://pinboard.in/u:jm/b:888686c06181/Random with care2018-01-05T22:14:12+00:00
https://eev.ee/blog/2018/01/02/random-with-care/
jmcoding random math rngs prngs statistics distributionshttps://pinboard.in/https://pinboard.in/u:jm/b:2f07c1bad73e/The Impenetrable Program Transforming How Courts Treat DNA Evidence | WIRED2017-11-30T11:45:46+00:00
https://www.wired.com/story/trueallele-software-transforming-how-courts-treat-dna-evidence/
jmlaw justice trueallele software dna evidence statistics probability code-review auditinghttps://pinboard.in/https://pinboard.in/u:jm/b:8b677ee89f88/Cycling to work: major new study suggests health benefits are staggering2017-08-20T21:17:27+00:00
https://theconversation.com/cycling-to-work-major-new-study-suggests-health-benefits-are-staggering-76292#link_time=1501254014
jmWe found that cycling to work was associated with a 41% lower risk of dying overall compared to commuting by car or public transport. Cycle commuters had a 52% lower risk of dying from heart disease and a 40% lower risk of dying from cancer. They also had 46% lower risk of developing heart disease and a 45% lower risk of developing cancer at all.
]]>cycling transport health medicine science commuting life statisticshttps://pinboard.in/https://pinboard.in/u:jm/b:ce8e076e456b/Physical separation of cyclists from traffic “crucial” to dropping injury rates, shows U.S. study2017-05-13T21:01:51+00:00
https://cyclingindustry.news/physical-separation-of-cyclists-from-traffic-crucial-to-dropping-injury-rates-shows-u-s-study/
jmCiting a further study of differing types of cycling infrastructure in Canada, the editorial writes that an 89% increase in safety was noted on streets with physical separation over streets where no such infrastructure existed. Unprotected cycling space was found to be 53% safer.
In 2014 there were 902 recorded cyclists fatalities in America and 35,206 serious injuries. Per kilometre cycled fatalities per 100 million kilometres cycled sat at 4.7. In the Netherlands and Denmark those rates sit at 1 and 1.1, respectively.
]]>cycling infrastructure roads safety accidents cars statistics us canadahttps://pinboard.in/https://pinboard.in/u:jm/b:c9fca3759498/The eigenvector of "Why we moved from language X to language Y"2017-03-16T23:18:20+00:00
https://erikbern.com/2017/03/15/the-eigenvector-of-why-we-moved-from-language-x-to-language-y.html
jmstatistics programming languages golang go mysql codinghttps://pinboard.in/https://pinboard.in/u:jm/b:bc481ec8d1b8/tdunning/t-digest2016-12-12T12:28:16+00:00
https://github.com/tdunning/t-digest
jmA new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.
The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.
Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.]]>data-structures algorithms java t-digest statistics quantiles percentiles aggregation digests estimation rankinghttps://pinboard.in/https://pinboard.in/u:jm/b:aaf9fb613f21/The Fall of BIG DATA – arg min blog2016-11-14T22:01:12+00:00
http://www.argmin.net/2016/11/14/fall-of-big-data/
jmOur community has developed remarkably effective tools to microtarget advertisements. But if you use ad models to deliver news, that’s propaganda. And just because we didn’t intend to spread rampant misinformation doesn’t mean we are not responsible.
]]>big-data analytics data-science statistics us-politics trump data science propaganda facebook silicon-valleyhttps://pinboard.in/https://pinboard.in/u:jm/b:1b56e66fcb3a/How One 19-Year-Old Illinois Man Is Distorting National Polling Averages - The New York Times2016-10-13T10:57:27+00:00
http://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html?_r=0
jmstatistics nytimes politics via:reddit donald-trump hilary-clinton polling panels pollshttps://pinboard.in/https://pinboard.in/u:jm/b:dd1769070f5d/MRI software bugs could upend years of research - The Register2016-07-05T10:40:25+00:00
http://www.theregister.co.uk/2016/07/03/mri_software_bugs_could_upend_years_of_research/?mt=1467666616578
jmIn their paper at PNAS, they write: “the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.”
For example, a bug that's been sitting in a package called 3dClustSim for 15 years, fixed in May 2015, produced bad results (3dClustSim is part of the AFNI suite; the others are SPM and FSL). That's not a gentle nudge that some results might be overstated: it's more like making a bonfire of thousands of scientific papers.
Further: “Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape”.
The researchers used published fMRI results, and along the way they swipe the fMRI community for their “lamentable archiving and data-sharing practices” that prevent most of the discipline's body of work being re-analysed. ®
]]>fmri science mri statistics cluster-inference autocorrelation data papers medicine false-positives fps neuroimaginghttps://pinboard.in/https://pinboard.in/u:jm/b:44448561ba5c/You CAN Average Percentiles2016-07-05T10:18:14+00:00
http://rpubs.com/jrauser/percentiles
jmstatistics percentiles quantiles john-rauser histograms averaging mean p99https://pinboard.in/https://pinboard.in/u:jm/b:e2f019aeecad/Differential Privacy2016-06-15T10:48:04+00:00
http://blog.cryptographyengineering.com/2016/06/what-is-differential-privacy.html
jmapple privacy anonymization google rappor algorithms sampling populations statistics differential-privacyhttps://pinboard.in/https://pinboard.in/u:jm/b:664a8dab51a2/The NSA’s SKYNET program may be killing thousands of innocent people2016-02-16T14:55:33+00:00
http://arstechnica.co.uk/security/2016/02/the-nsas-skynet-program-may-be-killing-thousands-of-innocent-people/
jm
The NSA evaluates the SKYNET program using a subset of 100,000 randomly selected people (identified by their MSIDN/MSI pairs of their mobile phones), and a a known group of seven terrorists. The NSA then trained the learning algorithm by feeding it six of the terrorists and tasking SKYNET to find the seventh. This data provides the percentages for false positives in the slide above.
"First, there are very few 'known terrorists' to use to train and test the model," Ball said. "If they are using the same records to train the model as they are using to test the model, their assessment of the fit is completely bullshit. The usual practice is to hold some of the data out of the training process so that the test includes records the model has never seen before. Without this step, their classification fit assessment is ridiculously optimistic."
The reason is that the 100,000 citizens were selected at random, while the seven terrorists are from a known cluster. Under the random selection of a tiny subset of less than 0.1 percent of the total population, the density of the social graph of the citizens is massively reduced, while the "terrorist" cluster remains strongly interconnected. Scientifically-sound statistical analysis would have required the NSA to mix the terrorists into the population set before random selection of a subset—but this is not practical due to their tiny number.
This may sound like a mere academic problem, but, Ball said, is in fact highly damaging to the quality of the results, and thus ultimately to the accuracy of the classification and assassination of people as "terrorists." A quality evaluation is especially important in this case, as the random forest method is known to overfit its training sets, producing results that are overly optimistic. The NSA's analysis thus does not provide a good indicator of the quality of the method.
]]>terrorism surveillance nsa security ai machine-learning random-forests horror false-positives classification statisticshttps://pinboard.in/https://pinboard.in/u:jm/b:4442c0f23ed8/The general birthday problem2016-02-01T11:03:25+00:00
http://www.johndcook.com/blog/2016/01/30/general-birthday-problem/
jmhashing hashes collisions birthday-problem birthday-paradox coding probability statisticshttps://pinboard.in/https://pinboard.in/u:jm/b:5e19813a6fb5/The Guinness Brewer Who Revolutionized Statistics2016-01-04T12:37:26+00:00
http://priceonomics.com/the-guinness-brewer-who-revolutionized-statistics/
jmUpon completing his work on the t-distribution, Gosset was eager to make his work public. It was an important finding, and one he wanted to share with the wider world. The managers of Guinness were not so keen on this. They realized they had an advantage over the competition by using this method, and were not excited about relinquishing that leg up. If Gosset were to publish the paper, other breweries would be on to them. So they came to a compromise. Guinness agreed to allow Gosset to publish the finding, as long as he used a pseudonym. This way, competitors would not be able to realize that someone on Guinness’s payroll was doing such research, and figure out that the company’s scientifically enlightened approach was key to their success.
]]>statistics william-gosset history guinness brewing t-test pseudonyms dublinhttps://pinboard.in/https://pinboard.in/u:jm/b:2209fbbfcbd5/Placebo effects are weak: regression to the mean is the main reason ineffective treatments appear to work2015-12-16T14:17:36+00:00
http://www.dcscience.net/2015/12/11/placebo-effects-are-weak-regression-to-the-mean-is-the-main-reason-ineffective-treatments-appear-to-work/
jm“Statistical regression to the mean predicts that patients selected for abnormalcy will, on the average, tend to improve. We argue that most improvements attributed to the placebo effect are actually instances of statistical regression.”
]]>medicine science statistics placebo evidence via:hn regression-to-the-meanhttps://pinboard.in/https://pinboard.in/u:jm/b:71de2897190c/Very Fast Reservoir Sampling2015-12-15T12:03:18+00:00
http://erikerlandson.github.io/blog/2015/11/20/very-fast-reservoir-sampling/
jmstatistics reservoir-sampling sampling algorithms poisson bernoulli performancehttps://pinboard.in/https://pinboard.in/u:jm/b:c4fe345c5f6b/Why Percentiles Don’t Work the Way you Think2015-12-09T11:26:47+00:00
https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think
jmperformance percentiles quantiles statistics metrics monitoring baron-schwartz vividcortexhttps://pinboard.in/https://pinboard.in/u:jm/b:c441c328a979/The reusable holdout: Preserving validity in adaptive data analysis2015-08-18T13:21:04+00:00
http://googleresearch.blogspot.ie/2015/08/the-reusable-holdout-preserving.html
jmstatistics google reusable-holdout training ml machine-learning data-analysis holdout corpus samplinghttps://pinboard.in/https://pinboard.in/u:jm/b:0b87ef283056/Dublin Bike Theft Survey Results2015-05-08T08:56:17+00:00
http://www.dublincycling.com/cycling/bike-theft-survey-results
jmdublin bikes cycling theft crime statistics infographics dcchttps://pinboard.in/https://pinboard.in/u:jm/b:d3eead7d37e0/Ask the Decoder: Did I sign up for a global sleep study?2015-03-09T17:34:21+00:00
http://america.aljazeera.com/articles/2014/10/29/sleep-study.html
jmHow meaningful is this corporate data science, anyway? Given the tech-savvy people in the Bay Area, Jawbone likely had a very dense sample of Jawbone wearers to draw from for its Napa earthquake analysis. That allowed it to look at proximity to the epicenter of the earthquake from location information.
Jawbone boasts its sample population of roughly “1 million Up wearers who track their sleep using Up by Jawbone.” But when looking into patterns county by county in the U.S., Jawbone states, it takes certain statistical liberties to show granularity while accounting for places where there may not be many Jawbone users.
So while Jawbone data can show us interesting things about sleep patterns across a very large population, we have to remember how selective that population is. Jawbone wearers are people who can afford a $129 wearable fitness gadget and the smartphone or computer to interact with the output from the device.
Jawbone is sharing what it learns with the public, but think of all the public health interests or other third parties that might be interested in other research questions from a large scale data set. Yet this data is not collected with scientific processes and controls and is not treated with the rigor and scrutiny that a scientific study requires.
Jawbone and other fitness trackers don’t give us the option to use their devices while opting out of contributing to the anonymous data sets they publish. Maybe that ought to change.
]]>jawbone privacy data-protection anonymization aggregation data medicine health earthquakes statistics iot wearableshttps://pinboard.in/https://pinboard.in/u:jm/b:b2cff21e8284/HdrHistogram: A better latency capture method2015-02-16T11:27:54+00:00
http://psy-lob-saw.blogspot.ie/2015/02/hdrhistogram-better-latency-capture.html
jmhdrhistogram hdr histograms statistics latency measurement metrics percentiles quantiles gil-tene nitsan-wakarthttps://pinboard.in/https://pinboard.in/u:jm/b:fe1e9f2ecc3d/scumbag data scientist memes2015-01-31T11:10:18+00:00
http://www.quickmeme.com/scumbag-data-scientist
jmfunny data-science statistics machine-learning hadoop bayes memes image-macroshttps://pinboard.in/https://pinboard.in/u:jm/b:c2158da0bb75/Stop Playing Monopoly With Your Kids (And Play These Games Instead) | FiveThirtyEight2015-01-26T16:16:07+00:00
http://fivethirtyeight.com/features/stop-playing-monopoly-with-your-kids-and-play-these-games-instead/
jmboardgames games kids children 538 statistics ratingshttps://pinboard.in/https://pinboard.in/u:jm/b:72d6c2d53a07/Schneier on Security: Why Data Mining Won't Stop Terror2015-01-12T15:07:56+00:00
https://www.schneier.com/essays/archives/2005/03/why_data_mining_wont.html
jmThis unrealistically accurate system will generate 1 billion false alarms for every real terrorist plot it uncovers. Every day of every year, the police will have to investigate 27 million potential plots in order to find the one real terrorist plot per month. Raise that false-positive accuracy to an absurd 99.9999 percent and you're still chasing 2,750 false alarms per day -- but that will inevitably raise your false negatives, and you're going to miss some of those 10 real plots.
Also, Ben Goldacre saying the same thing: http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/]]>internet scanning filtering specificity statistics data-mining terrorism law nsa gchq false-positives false-negativeshttps://pinboard.in/https://pinboard.in/u:jm/b:40691f3d07b8/Introducing practical and robust anomaly detection in a time series2015-01-07T22:42:32+00:00
https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series
jmEarly detection of anomalies plays a key role in ensuring high-fidelity data is available to our own product teams and those of our data partners. This package helps us monitor spikes in user engagement on the platform surrounding holidays, major sporting events or during breaking news. Beyond surges in social engagement, exogenic factors – such as bots or spammers – may cause an anomaly in number of favorites or followers. The package can be used to find such bots or spam, as well as detect anomalies in system metrics after a new software release. We’re open-sourcing AnomalyDetection because we’d like the public community to evolve the package and learn from it as we have.
]]>statistics twitter r anomaly-detection outliers metrics time-series spikes holt-wintershttps://pinboard.in/https://pinboard.in/u:jm/b:569f792516a5/UncertML2015-01-07T00:17:32+00:00
http://www.uncertml.org/
jma conceptual model, with accompanying XML schema, that may be used to quantify and exchange complex uncertainties in data. The interoperable model can be used to describe uncertainty in a variety of ways including:
Samples
Statistics including mean, variance, standard deviation and quantile
Probability distributions including marginal and joint distributions and mixture models
]]>via:conor uncertainty statistics xml formatshttps://pinboard.in/https://pinboard.in/u:jm/b:d25e81904ab8/'Uncertain<T>: A First-Order Type for Uncertain Data' [paper, PDF]2014-12-28T23:20:32+00:00
http://www.cs.utexas.edu/users/mckinley/papers/uncertainty-asplos-2014.pdf
jm, a new programming
language abstraction for uncertain data. We implement a
Bayesian network semantics for computation and conditionals
that improves program correctness. The runtime uses sampling
and hypothesis tests to evaluate computation and conditionals
lazily and efficiently. We illustrate with sensor and
machine learning applications that Uncertain improves
expressiveness and accuracy.'
(via Tony Finch)]]>uncertainty estimation types strong-typing coding probability statistics machine-learning sampling via:fanfhttps://pinboard.in/https://pinboard.in/u:jm/b:4e691997eba0/"The Programming Language Wars - Questions And Responsibilities for the Programming Language Community"2014-12-04T11:16:06+00:00
http://www.codemesh.io/static/upload/media/141562653162935languagewars.pdf
jmstatistics data coding languages static-typing dynamichttps://pinboard.in/https://pinboard.in/u:jm/b:87eae3c5fd2f/Life expectancy increases are due mainly to healthier children, not longer old age2014-11-11T10:35:33+00:00
http://www.ssa.gov/history/lifeexpect.html
jmvia:fplogue statistics taxes life-expectancy pensions infant-mortality health 1930shttps://pinboard.in/https://pinboard.in/u:jm/b:ee24e0e95004/FelixGV/tehuti2014-10-09T10:53:00+00:00
https://github.com/FelixGV/tehuti
jmasl2 apache open-source tehuti metrics percentiles quantiles statistics measurement latency kafka voldemort linkedinhttps://pinboard.in/https://pinboard.in/u:jm/b:a2f55ebce7bb/Tehuti2014-10-08T09:45:50+00:00
https://groups.google.com/forum/#!msg/project-voldemort/Y52UyHQ8tBA/9Ei79_RvS3EJ
jmkafka metrics dropwizard java scala jvm timers ewma statistics measurement latency sampling tehuti voldemort linkedin jay-krepshttps://pinboard.in/https://pinboard.in/u:jm/b:b56664c1a098/tinystat - GoDoc2014-09-21T22:20:09+00:00
http://godoc.org/github.com/codahale/tinystat/cmd/tinystat
jmtinystat is used to compare two or more sets of measurements (e.g., runs of a multiple runs of benchmarks of two possible implementations) and determine if they are statistically different, using Student's t-test. It's inspired largely by FreeBSD's ministat (written by Poul-Henning Kamp).
]]>t-test student statistics go coda-hale tinystat stats tools command-line unixhttps://pinboard.in/https://pinboard.in/u:jm/b:4f23dc3f0640/CausalImpact: A new open-source package for estimating causal effects in time series2014-09-15T10:50:48+00:00
http://google-opensource.blogspot.ie/2014/09/causalimpact-new-open-source-package.html
jmHow can we measure the number of additional clicks or sales that an AdWords campaign generated? How can we estimate the impact of a new feature on app downloads? How do we compare the effectiveness of publicity across countries?
In principle, all of these questions can be answered through causal inference.
In practice, estimating a causal effect accurately is hard, especially when a randomised experiment is not available. One approach we've been developing at Google is based on Bayesian structural time-series models. We use these models to construct a synthetic control — what would have happened to our outcome metric in the absence of the intervention. This approach makes it possible to estimate the causal effect that can be attributed to the intervention, as well as its evolution over time.
We've been testing and applying structural time-series models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control.
Today, we're excited to announce the release of CausalImpact, an open-source R package that makes causal analyses simple and fast. With its release, all of our advertisers and users will be able to use the same powerful methods for estimating causal effects that we've been using ourselves.
Our main motivation behind creating the package has been to find a better way of measuring the impact of ad campaigns on outcomes. However, the CausalImpact package could be used for many other applications involving causal inference. Examples include problems found in economics, epidemiology, or the political and social sciences.
]]>causal-inference r google time-series models bayes adwords advertising statistics estimation metricshttps://pinboard.in/https://pinboard.in/u:jm/b:a62b2f300071/Punished for Being Poor: Big Data in the Justice System2014-08-19T13:20:55+00:00
http://www.psmag.com/navigation/politics-and-law/punished-poor-problem-using-big-data-justice-system-88651/
jmCurrently, over 20 states use data-crunching risk-assessment programs for sentencing decisions, usually consisting of proprietary software whose exact methods are unknown, to determine which individuals are most likely to re-offend. The Senate and House are also considering similar tools for federal sentencing. These data programs look at a variety of factors, many of them relatively static, like criminal and employment history, age, gender, education, finances, family background, and residence. Indiana, for example, uses the LSI-R, the legality of which was upheld by the state’s supreme court in 2010. Other states use a model called COMPAS, which uses many of the same variables as LSI-R and even includes high school grades. Others are currently considering the practice as a way to reduce the number of inmates and ensure public safety. (Many more states use or endorse similar assessments when sentencing sex offenders, and the programs have been used in parole hearings for years.) Even the American Law Institute has embraced the practice, adding it to the Model Penal Code, attesting to the tool’s legitimacy.
(via stroan)]]>via:stroan statistics false-positives big-data law law-enforcement penal-code risk sentencinghttps://pinboard.in/https://pinboard.in/u:jm/b:26eaf70354e0/Monitoring Reactive Applications with Kamon2014-05-19T10:41:38+00:00
http://kamon.io/presentations/javacro14/#/
jmmetrics dropwizard hdrhistogram gil-tene kamon akka spray play reactive statistics java scala percentiles latencyhttps://pinboard.in/https://pinboard.in/u:jm/b:dd1798009833/Daylight saving time linked to heart attacks, study finds2014-03-31T08:41:24+00:00
http://www.irishtimes.com/news/health/daylight-saving-time-linked-to-heart-attacks-study-finds-1.1743441
jmSwitching over to daylight saving time, and losing one hour of sleep, raised the risk of having a heart attack the following Monday by 25 per cent, compared to other Mondays during the year, according to a new US study released today. [...] The study found that heart attack risk fell 21 per cent later in the year, on the Tuesday after the clock was returned to standard time, and people got an extra hour’s sleep.
One clear answer: we need 25-hour days.
More details: http://www.sciencedaily.com/releases/2014/03/140329175108.htm --
Researchers used Michigan's BMC2 database, which collects data from all non-federal hospitals across the state, to identify admissions for heart attacks requiring percutaneous coronary intervention from Jan. 1, 2010 through Sept. 15, 2013. A total of 42,060 hospital admissions occurring over 1,354 days were included in the analysis. Total daily admissions were adjusted for seasonal and weekday variation, as the rate of heart attacks peaks in the winter and is lowest in the summer and is also greater on Mondays and lower over the weekend. The hospitals included in this study admit an average of 32 patients having a heart attack on any given Monday. But on the Monday immediately after springing ahead there were on average an additional eight heart attacks. There was no difference in the total weekly number of percutaneous coronary interventions performed for either the fall or spring time changes compared to the weeks before and after the time change.
]]>daylight dst daylight-savings time dates calendar science health heart-attacks michigan hospitals statisticshttps://pinboard.in/https://pinboard.in/u:jm/b:1e3bf8d8c6b0/Analyzing Citibike Usage2014-03-18T14:51:29+00:00
http://abe.is/analyzing-citibike-usage/
jmdata correlation statistics citibike cycling nyc data-science weatherhttps://pinboard.in/https://pinboard.in/u:jm/b:5a3ec788c5c3/How the search for flight AF447 used Bayesian inference2014-03-12T15:33:10+00:00
http://www.bea.aero/fr/enquetes/vol.af.447/metron.search.analysis.pdf
jmmetron bayes bayesian-inference machine-learning statistics via:jgc air-france disasters probability inference searchinghttps://pinboard.in/https://pinboard.in/u:jm/b:e7c127ca54da/Sacked Google worker says staff ratings fixed to fit template2014-03-12T10:59:00+00:00
http://www.irishtimes.com/news/ireland/irish-news/sacked-google-worker-says-staff-ratings-fixed-to-fit-template-1.1721176
jmstack-ranking google ireland employment work bell-curve statistics eric-schmidthttps://pinboard.in/https://pinboard.in/u:jm/b:49351fe530da/"A data scientist is a ..."2014-02-01T21:01:57+00:00
https://twitter.com/jeremyjarvis/status/428848527226437632/photo/1
jmdata-scientist statistics statistician funny jokes san-francisco tech monkigrashttps://pinboard.in/https://pinboard.in/u:jm/b:2030d373aa62/Nassim Taleb: retire Standard Deviation2014-01-15T21:32:15+00:00
http://www.edge.org/response-detail/25401
jmstatistics standard-deviation stddev maths nassim-taleb deviation volatility rmse distributionshttps://pinboard.in/https://pinboard.in/u:jm/b:f9906de54f2e/"The Top 6 Reasons This Infographic Is Just Wrong Enough To Sound Convincing"2013-11-06T17:10:13+00:00
http://cf.broadsheet.ie/wp-content/uploads/2013/11/20131106.jpg
jmdiagrams infographics infoviz visualisation data fail statisticshttps://pinboard.in/https://pinboard.in/u:jm/b:e8bf0f36332a/Statsite2013-11-01T16:58:31+00:00
http://armon.github.io/statsite/
jmStatsite is designed to be both highly performant, and very flexible. To achieve this, it implements the stats collection and aggregation in pure C, using libev to be extremely fast. This allows it to handle hundreds of connections, and millions of metrics. After each flush interval expires, statsite performs a fork/exec to start a new stream handler invoking a specified application. Statsite then streams the aggregated metrics over stdin to the application, which is free to handle the metrics as it sees fit. This allows statsite to aggregate metrics and then ship metrics to any number of sinks (Graphite, SQL databases, etc). There is an included Python script that ships metrics to graphite.
]]>statsd graphite statsite performance statistics service-metrics metrics opshttps://pinboard.in/https://pinboard.in/u:jm/b:98f01ffaa9cc/"Effective Computation of Biased Quantiles over Data Streams" [paper]2013-11-01T16:57:06+00:00
http://www.cs.rutgers.edu/~muthu/bquant.pdf
jm
Skew is prevalent in many data sources such as IP traffic streams.To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two prob-lems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively, using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the “high-biased” quantiles and the “targeted” quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures.Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over high-speed data streams.
Implemented as a timer-histogram storage system in http://armon.github.io/statsite/ .]]>
statistics quantiles percentiles stream-processing skew papers histograms latency algorithmshttps://pinboard.in/https://pinboard.in/u:jm/b:54346b7d58f0/_Availability in Globally Distributed Storage Systems_ [pdf]2013-09-24T22:08:06+00:00
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36737.pdf
jmvia:kragen failure bigtable gfs statistics outages reliabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:bb7d3593288e/Fat Tails2013-07-02T20:44:37+00:00
http://vudlab.com/fat-tails.html
jmA fat-tailed distribution looks normal but the parts far away from the average are thicker, meaning a higher chance of huge deviations. [...] Fat tails don't mean more variance; just different variance. For a given variance, a higher chance of extreme deviations implies a lower chance of medium ones.
]]>dataviz via:hn statistics visualization distributions fat-tailed kurtosis d3.js javascript variance deviationhttps://pinboard.in/https://pinboard.in/u:jm/b:ccd01496776d/Boundary's Early Warnings alarm2013-06-27T21:09:22+00:00
http://boundary.com/blog/2013/06/27/announcing-early-warnings/
jmnetwork-monitoring throughput boundary service-metrics alarming ops statisticshttps://pinboard.in/https://pinboard.in/u:jm/b:d20187298612/Not the ‘best in the world’ - The Medical Independent2013-04-18T13:50:37+00:00
http://www.medicalindependent.ie/20844/news
jm'Our maternity services are amongst the best in the world’. This phrase has been much hackneyed since the heartbreaking death of Savita Halappanavar was revealed in mid October. James Reilly and other senior politicians are particularly guilty of citing this inaccurate position. So what is the state of Irish maternity services and how do our figures compare with other comparable countries? Let’s start with the statistics.
The bottom line:
Eight deaths per 100,000 is not bad, but it ranks our maternity services far from the best in world and below countries such as Slovakia and Poland.
]]>pro-choice ireland savita medicine health maternity morbidity statisticshttps://pinboard.in/https://pinboard.in/u:jm/b:a517efcd5326/