Pinboard (jm)

Pinboard (jm) https://pinboard.in/u:jm/public/ recent bookmarks from jm UK COVID vaccination modelling was dependent on a single Pythonista 2024-02-12T16:11:24+00:00 https://christinapagel.substack.com/p/where-are-we-with-covid-in-england jm excel python modelling statistics uk ukhsa qa covid-19 quality-control https://pinboard.in/ https://pinboard.in/u:jm/b:67997cd0272c/ Turning Poetry into Art: Joanne McNeil on Large Language Models and the Poetry of Allison Parrish | Filmmaker Magazine 2023-07-31T16:13:48+00:00 https://filmmakermagazine.com/121867-joanne-mcneil-large-language-models-allison-parrish/ jmParrish has long thought of her work in conversation with Oulipo and other avant-garde movements, “using randomness to produce juxtapositions of concepts to make you think more deeply about the language that you’re using.” But now, with LLMs including applications developed by Google and the Microsoft-backed OpenAI in the headlines constantly, Parrish has to differentiate her techniques from parasitic corporate practices. “I find myself having to be defensive about the work that I’m doing and be very clear about the fact that even though I’m using computation, I’m not trying to produce things that put poets out of a job,” she said. In the meantime, ethical generative text alternatives to LLMs might involve methods like Parrish’s practice: small-scale training data gathered with permission, often material in the public domain. “Just because something’s in the public domain doesn’t necessarily mean that it’s ethical to use it, but it’s a good starting point,” Parrish told me. ... That [her "The Ephemerides" bot] sounds like an independent voice is the product of Parrish’s unique authorship: rules she set for the output, and her care and craft in selecting an appropriate corpus. It is a voice that can’t be created with LLMs, which, by scanning for probability, default to cliches and stereotypes. “They’re inherently conservative,” Parrish said. “They encode the past, literally. That’s what they’re doing with these data sets.” ]]> ai poetry ml statistics alison-parrish art poems generative-art text randomness https://pinboard.in/ https://pinboard.in/u:jm/b:7601debb0064/ Latest Long Covid estimates 2022-10-19T10:07:45+00:00 https://jamanetwork.com/journals/jama/fullarticle/2797443 jmA total of 1.2 million individuals who had symptomatic SARS-CoV-2 infection were included (mean age, 4-66 years; males, 26%-88%). In the modeled estimates, 6.2% (95% uncertainty interval [UI], 2.4%-13.3%) of individuals who had symptomatic SARS-CoV-2 infection experienced at least 1 of the 3 Long COVID symptom clusters in 2020 and 2021, including 3.2% (95% UI, 0.6%-10.0%) for persistent fatigue with bodily pain or mood swings, 3.7% (95% UI, 0.9%-9.6%) for ongoing respiratory problems, and 2.2% (95% UI, 0.3%-7.6%) for cognitive problems after adjusting for health status before COVID-19, comprising an estimated 51.0% (95% UI, 16.9%-92.4%), 60.4% (95% UI, 18.9%-89.1%), and 35.4% (95% UI, 9.4%-75.1%), respectively, of Long COVID cases. The Long COVID symptom clusters were more common in women aged 20 years or older (10.6% [95% UI, 4.3%-22.2%]) 3 months after symptomatic SARS-CoV-2 infection than in men aged 20 years or older (5.4% [95% UI, 2.2%-11.7%]). Both sexes younger than 20 years of age were estimated to be affected in 2.8% (95% UI, 0.9%-7.0%) of symptomatic SARS-CoV-2 infections. The estimated mean Long COVID symptom cluster duration was 9.0 months (95% UI, 7.0-12.0 months) among hospitalized individuals and 4.0 months (95% UI, 3.6-4.6 months) among nonhospitalized individuals. Among individuals with Long COVID symptoms 3 months after symptomatic SARS-CoV-2 infection, an estimated 15.1% (95% UI, 10.3%-21.1%) continued to experience symptoms at 12 months. ]]> long-covid statistics disease covid-19 papers jama disability https://pinboard.in/ https://pinboard.in/u:jm/b:cf65bc10f43a/ The model used to simulate the Irish COVID-19 response 2021-07-01T08:39:12+00:00 https://twitter.com/President_MU/status/1410315246791802884 jm modelling data-science statistics epidemiology pandemics covid-19 sars-cov-2 ireland philip-nolan seir https://pinboard.in/ https://pinboard.in/u:jm/b:21e859682f26/ want to ace an AI-based interview? add a bookshelf in the background 2021-02-22T10:59:18+00:00 https://twitter.com/hatr/status/1361756449802768387 jm correlation clever-hans funny ai ml interviewing statistics phrenology https://pinboard.in/ https://pinboard.in/u:jm/b:275724d41508/ Prio 2020-09-24T11:23:08+00:00 https://crypto.stanford.edu/prio/paper.pdf jm nizk zero-knowledge snark prio crypto privacy data-privacy statistics quantiles percentiles aggregation https://pinboard.in/ https://pinboard.in/u:jm/b:2aa675a6dbfc/ Sweden has the smallest average household size in Europe 2020-09-21T10:14:59+00:00 https://twitter.com/AdamJKucharski/status/1307958852248272898 jm covid-19 sweden households europe statistics eu housing https://pinboard.in/ https://pinboard.in/u:jm/b:940ab75a2d35/ illustration of how a rise in SARS-CoV-2 positivity in younger groups can soon become a rise in older groups 2020-09-08T09:26:06+00:00 https://twitter.com/vincentglad/status/1303243869933404161/photo/1 jm testing covid-19 age epidemiology dataviz statistics marseilles france https://pinboard.in/ https://pinboard.in/u:jm/b:7ed9065e39e7/ A-Levels: The Model is not the Student 2020-08-17T11:19:06+00:00 http://thaines.com/post/alevels2020 jm ofqual schools marks exams a-levels uk estimation statistics maths fail https://pinboard.in/ https://pinboard.in/u:jm/b:f95b2089f16c/ Interpreting Covid-19 Test Results: A Bayesian Approach 2020-06-13T22:42:29+00:00 https://medium.com/@Bob_Wachter/interpreting-covid-19-test-results-a-bayesian-approach-df058dad2ade jma brief tutorial on Covid-19 testing, with an emphasis on a Bayesian approach. After presenting the basics, we’ll walk through four confusing Covid-19 testing scenarios, just to give you a feel for the kinds of pickles we often find ourselves in. ]]> prevalence covid-19 bayes bayesian statistics testing https://pinboard.in/ https://pinboard.in/u:jm/b:419f48573a05/ on COVID-19 death rate statistics 2020-04-22T12:01:00+00:00 https://twitter.com/Care2much18/status/1252819591090155523 jm covid-19 statistics lies-damn-lies death-rates comorbidity diseases europe deaths https://pinboard.in/ https://pinboard.in/u:jm/b:a34779538059/ Coronavirus: Why You Must Act Now - Tomas Pueyo - Medium 2020-03-11T10:30:15+00:00 https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca jm coronavirus covid19 healthcare epidemiology diseases statistics https://pinboard.in/ https://pinboard.in/u:jm/b:afc303b28ce7/ Want To Make Money? Build A Business On A Bike Lane 2019-11-25T11:30:25+00:00 https://www.fastcompany.com/90182112/want-to-make-money-build-a-business-on-a-bike-lane jm numbers statistics cycling bike-lanes shops https://pinboard.in/ https://pinboard.in/u:jm/b:4a8cc7b40f48/ electricityMap 2019-10-07T16:33:35+00:00 https://www.electricitymap.org/?page=country&solar=false&remote=true&wind=false&countryCode=IE jm electricity statistics graphs data energy climate renewables carbon co2 https://pinboard.in/ https://pinboard.in/u:jm/b:4a92fe5beab6/ Google release an open-source differential-privacy lib 2019-09-09T14:24:03+00:00 https://developers.googleblog.com/2019/09/enabling-developers-and-organizations.html jmDifferentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual's data to be distinguished or re-identified. This type of analysis can be implemented in a wide variety of ways and for many different purposes. For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner. Currently, we provide algorithms to compute the following: Count Sum Mean Variance Standard deviation Order statistics (including min, max, and median) ]]> analytics google ml privacy differential-privacy aggregation statistics obfuscation approximation algorithms https://pinboard.in/ https://pinboard.in/u:jm/b:98439e468432/ France Bans Judge Analytics, 5 Years In Prison For Rule Breakers 2019-06-05T10:56:24+00:00 https://www.artificiallawyer.com/2019/06/04/france-bans-judge-analytics-5-years-in-prison-for-rule-breakers/ jm ‘The identity data of magistrates and members of the judiciary cannot be reused with the purpose or effect of evaluating, analysing, comparing or predicting their actual or alleged professional practices.’ As far as Artificial Lawyer understands, this is the very first example of such a ban anywhere in the world. Insiders in France told Artificial Lawyer that the new law is a direct result of an earlier effort to make all case law easily accessible to the general public, which was seen at the time as improving access to justice and a big step forward for transparency in the justice sector. However, judges in France had not reckoned on NLP and machine learning companies taking the public data and using it to model how certain judges behave in relation to particular types of legal matter or argument, or how they compare to other judges. In short, they didn’t like how the pattern of their decisions – now relatively easy to model – were potentially open for all to see. ]]> censorship france analytics judgements legal judges statistics https://pinboard.in/ https://pinboard.in/u:jm/b:70c20892a220/ [1902.04023] Computing Extremely Accurate Quantiles Using t-Digests 2019-02-18T11:05:52+00:00 https://arxiv.org/abs/1902.04023 jm java go python open-source quantiles percentiles approximation statistics sketching algorithms via:fanf https://pinboard.in/ https://pinboard.in/u:jm/b:6c84ec8a0947/ _AI Ethics, Impossibility Theorems and Tradeoffs_ 2019-01-28T16:14:07+00:00 https://www.chrisstucchio.com/pubs/slides/crunchconf_2018/slides.pdf jm discrimination ethics racism race ai statistics compas machine-learning https://pinboard.in/ https://pinboard.in/u:jm/b:bce0c1fb31f5/ Some facts on immigration to Ireland 2019-01-15T22:15:29+00:00 https://notesonthefront.typepad.com/politicaleconomy/2019/01/the-far-rights-problem-with-immigration-facts.html?fbclid=IwAR2MJkON4vAnWTuPsgFdop61--X0bndsmQd2SZz6ZIo8Jw2uWsGETGeP4Xc jmLet’s summarise: Ireland has a relatively high level of non-citizens in its population. But this is down to the high level of UK citizens and citizens from other English-speaking countries (US, Canada, Australia and New Zealand). Ireland has significantly fewer non-citizens from outside the English-speaking world than high-income EU countries. The proportion of non-citizens has remained stable over the last 10 years (i.e. there is no ‘surge’). Non-citizens in Ireland are more integrated into the labour market than any other high-income EU country – that is, there is lower unemployment among non-citizens. So much for the ‘sponging-off-the-state’ argument. We have had far fewer asylum-seekers and we grant asylum to far fewer than most other high-income EU countries. The claims of the Far Right and their allies collapse when we look to reality. ]]> immigration facts statistics ireland asylum-seekers https://pinboard.in/ https://pinboard.in/u:jm/b:8657155c24f0/ Surprisingly Little Evidence for the Accepted Wisdom About Teeth - The New York Times 2018-09-14T13:55:42+00:00 https://www.nytimes.com/2016/08/30/upshot/surprisingly-little-evidence-for-the-usual-wisdom-about-teeth.html jmA systematic review in 2011 concluded that, in adults, toothbrushing with flossing versus toothbrushing alone most likely reduced gingivitis, or inflammation of the gums. But there was really weak evidence that it reduced plaque in the short term. There was no evidence that it reduced cavities. That’s pretty much what we learned recently. ]]> teeth dentistry dental health medicine statistics science https://pinboard.in/ https://pinboard.in/u:jm/b:79ed3cc85695/ ICE's Risk Classification Assessment turned into a digital rubber stamp 2018-06-26T22:39:40+00:00 https://www.reuters.com/investigates/special-report/usa-immigration-court/ jmTo conform to Trump’s policies, Reuters has learned, ICE modified a tool officers have been using since 2013 when deciding whether an immigrant should be detained or released on bond. The computer-based Risk Classification Assessment uses statistics to determine an immigrant’s flight risk and danger to society. Previously, the tool automatically recommended either “detain” or “release.” Last year, ICE spokesman Bourke said, the agency removed the “release” recommendation. More: https://motherboard.vice.com/en_us/article/evk3kw/ice-modified-its-risk-assessment-software-so-it-automatically-recommends-detention ]]> immigration statistics machine-learning rubber-stamping fake-algorithms whitewashing ice us-politics https://pinboard.in/ https://pinboard.in/u:jm/b:52d02fbdf434/ A Closer Look at Experian Big Data and Artificial Intelligence in Durham Police 2018-04-09T10:27:17+00:00 https://bigbrotherwatch.org.uk/2018/04/a-closer-look-at-experian-big-data-and-artificial-intelligence-in-durham-police/ jm experian marketing credit-score data policing uk durham ai statistics crime hart https://pinboard.in/ https://pinboard.in/u:jm/b:888686c06181/ Random with care 2018-01-05T22:14:12+00:00 https://eev.ee/blog/2018/01/02/random-with-care/ jm coding random math rngs prngs statistics distributions https://pinboard.in/ https://pinboard.in/u:jm/b:2f07c1bad73e/ The Impenetrable Program Transforming How Courts Treat DNA Evidence | WIRED 2017-11-30T11:45:46+00:00 https://www.wired.com/story/trueallele-software-transforming-how-courts-treat-dna-evidence/ jm law justice trueallele software dna evidence statistics probability code-review auditing https://pinboard.in/ https://pinboard.in/u:jm/b:8b677ee89f88/ Cycling to work: major new study suggests health benefits are staggering 2017-08-20T21:17:27+00:00 https://theconversation.com/cycling-to-work-major-new-study-suggests-health-benefits-are-staggering-76292#link_time=1501254014 jmWe found that cycling to work was associated with a 41% lower risk of dying overall compared to commuting by car or public transport. Cycle commuters had a 52% lower risk of dying from heart disease and a 40% lower risk of dying from cancer. They also had 46% lower risk of developing heart disease and a 45% lower risk of developing cancer at all. ]]> cycling transport health medicine science commuting life statistics https://pinboard.in/ https://pinboard.in/u:jm/b:ce8e076e456b/ Physical separation of cyclists from traffic “crucial” to dropping injury rates, shows U.S. study 2017-05-13T21:01:51+00:00 https://cyclingindustry.news/physical-separation-of-cyclists-from-traffic-crucial-to-dropping-injury-rates-shows-u-s-study/ jmCiting a further study of differing types of cycling infrastructure in Canada, the editorial writes that an 89% increase in safety was noted on streets with physical separation over streets where no such infrastructure existed. Unprotected cycling space was found to be 53% safer. In 2014 there were 902 recorded cyclists fatalities in America and 35,206 serious injuries. Per kilometre cycled fatalities per 100 million kilometres cycled sat at 4.7. In the Netherlands and Denmark those rates sit at 1 and 1.1, respectively. ]]> cycling infrastructure roads safety accidents cars statistics us canada https://pinboard.in/ https://pinboard.in/u:jm/b:c9fca3759498/ The eigenvector of "Why we moved from language X to language Y" 2017-03-16T23:18:20+00:00 https://erikbern.com/2017/03/15/the-eigenvector-of-why-we-moved-from-language-x-to-language-y.html jm statistics programming languages golang go mysql coding https://pinboard.in/ https://pinboard.in/u:jm/b:bc481ec8d1b8/ tdunning/t-digest 2016-12-12T12:28:16+00:00 https://github.com/tdunning/t-digest jmA new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications. The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk. Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.]]> data-structures algorithms java t-digest statistics quantiles percentiles aggregation digests estimation ranking https://pinboard.in/ https://pinboard.in/u:jm/b:aaf9fb613f21/ The Fall of BIG DATA – arg min blog 2016-11-14T22:01:12+00:00 http://www.argmin.net/2016/11/14/fall-of-big-data/ jmOur community has developed remarkably effective tools to microtarget advertisements. But if you use ad models to deliver news, that’s propaganda. And just because we didn’t intend to spread rampant misinformation doesn’t mean we are not responsible. ]]> big-data analytics data-science statistics us-politics trump data science propaganda facebook silicon-valley https://pinboard.in/ https://pinboard.in/u:jm/b:1b56e66fcb3a/ How One 19-Year-Old Illinois Man Is Distorting National Polling Averages - The New York Times 2016-10-13T10:57:27+00:00 http://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html?_r=0 jm statistics nytimes politics via:reddit donald-trump hilary-clinton polling panels polls https://pinboard.in/ https://pinboard.in/u:jm/b:dd1769070f5d/ MRI software bugs could upend years of research - The Register 2016-07-05T10:40:25+00:00 http://www.theregister.co.uk/2016/07/03/mri_software_bugs_could_upend_years_of_research/?mt=1467666616578 jmIn their paper at PNAS, they write: “the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.” For example, a bug that's been sitting in a package called 3dClustSim for 15 years, fixed in May 2015, produced bad results (3dClustSim is part of the AFNI suite; the others are SPM and FSL). That's not a gentle nudge that some results might be overstated: it's more like making a bonfire of thousands of scientific papers. Further: “Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape”. The researchers used published fMRI results, and along the way they swipe the fMRI community for their “lamentable archiving and data-sharing practices” that prevent most of the discipline's body of work being re-analysed. ® ]]> fmri science mri statistics cluster-inference autocorrelation data papers medicine false-positives fps neuroimaging https://pinboard.in/ https://pinboard.in/u:jm/b:44448561ba5c/ You CAN Average Percentiles 2016-07-05T10:18:14+00:00 http://rpubs.com/jrauser/percentiles jm statistics percentiles quantiles john-rauser histograms averaging mean p99 https://pinboard.in/ https://pinboard.in/u:jm/b:e2f019aeecad/ Differential Privacy 2016-06-15T10:48:04+00:00 http://blog.cryptographyengineering.com/2016/06/what-is-differential-privacy.html jm apple privacy anonymization google rappor algorithms sampling populations statistics differential-privacy https://pinboard.in/ https://pinboard.in/u:jm/b:664a8dab51a2/ The NSA’s SKYNET program may be killing thousands of innocent people 2016-02-16T14:55:33+00:00 http://arstechnica.co.uk/security/2016/02/the-nsas-skynet-program-may-be-killing-thousands-of-innocent-people/ jm The NSA evaluates the SKYNET program using a subset of 100,000 randomly selected people (identified by their MSIDN/MSI pairs of their mobile phones), and a a known group of seven terrorists. The NSA then trained the learning algorithm by feeding it six of the terrorists and tasking SKYNET to find the seventh. This data provides the percentages for false positives in the slide above. "First, there are very few 'known terrorists' to use to train and test the model," Ball said. "If they are using the same records to train the model as they are using to test the model, their assessment of the fit is completely bullshit. The usual practice is to hold some of the data out of the training process so that the test includes records the model has never seen before. Without this step, their classification fit assessment is ridiculously optimistic." The reason is that the 100,000 citizens were selected at random, while the seven terrorists are from a known cluster. Under the random selection of a tiny subset of less than 0.1 percent of the total population, the density of the social graph of the citizens is massively reduced, while the "terrorist" cluster remains strongly interconnected. Scientifically-sound statistical analysis would have required the NSA to mix the terrorists into the population set before random selection of a subset—but this is not practical due to their tiny number. This may sound like a mere academic problem, but, Ball said, is in fact highly damaging to the quality of the results, and thus ultimately to the accuracy of the classification and assassination of people as "terrorists." A quality evaluation is especially important in this case, as the random forest method is known to overfit its training sets, producing results that are overly optimistic. The NSA's analysis thus does not provide a good indicator of the quality of the method. ]]> terrorism surveillance nsa security ai machine-learning random-forests horror false-positives classification statistics https://pinboard.in/ https://pinboard.in/u:jm/b:4442c0f23ed8/ The general birthday problem 2016-02-01T11:03:25+00:00 http://www.johndcook.com/blog/2016/01/30/general-birthday-problem/ jm hashing hashes collisions birthday-problem birthday-paradox coding probability statistics https://pinboard.in/ https://pinboard.in/u:jm/b:5e19813a6fb5/ The Guinness Brewer Who Revolutionized Statistics 2016-01-04T12:37:26+00:00 http://priceonomics.com/the-guinness-brewer-who-revolutionized-statistics/ jmUpon completing his work on the t-distribution, Gosset was eager to make his work public. It was an important finding, and one he wanted to share with the wider world. The managers of Guinness were not so keen on this. They realized they had an advantage over the competition by using this method, and were not excited about relinquishing that leg up. If Gosset were to publish the paper, other breweries would be on to them. So they came to a compromise. Guinness agreed to allow Gosset to publish the finding, as long as he used a pseudonym. This way, competitors would not be able to realize that someone on Guinness’s payroll was doing such research, and figure out that the company’s scientifically enlightened approach was key to their success. ]]> statistics william-gosset history guinness brewing t-test pseudonyms dublin https://pinboard.in/ https://pinboard.in/u:jm/b:2209fbbfcbd5/ Placebo effects are weak: regression to the mean is the main reason ineffective treatments appear to work 2015-12-16T14:17:36+00:00 http://www.dcscience.net/2015/12/11/placebo-effects-are-weak-regression-to-the-mean-is-the-main-reason-ineffective-treatments-appear-to-work/ jm“Statistical regression to the mean predicts that patients selected for abnormalcy will, on the average, tend to improve. We argue that most improvements attributed to the placebo effect are actually instances of statistical regression.” ]]> medicine science statistics placebo evidence via:hn regression-to-the-mean https://pinboard.in/ https://pinboard.in/u:jm/b:71de2897190c/ Very Fast Reservoir Sampling 2015-12-15T12:03:18+00:00 http://erikerlandson.github.io/blog/2015/11/20/very-fast-reservoir-sampling/ jm statistics reservoir-sampling sampling algorithms poisson bernoulli performance https://pinboard.in/ https://pinboard.in/u:jm/b:c4fe345c5f6b/ Why Percentiles Don’t Work the Way you Think 2015-12-09T11:26:47+00:00 https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think jm performance percentiles quantiles statistics metrics monitoring baron-schwartz vividcortex https://pinboard.in/ https://pinboard.in/u:jm/b:c441c328a979/ The reusable holdout: Preserving validity in adaptive data analysis 2015-08-18T13:21:04+00:00 http://googleresearch.blogspot.ie/2015/08/the-reusable-holdout-preserving.html jm statistics google reusable-holdout training ml machine-learning data-analysis holdout corpus sampling https://pinboard.in/ https://pinboard.in/u:jm/b:0b87ef283056/ Dublin Bike Theft Survey Results 2015-05-08T08:56:17+00:00 http://www.dublincycling.com/cycling/bike-theft-survey-results jm dublin bikes cycling theft crime statistics infographics dcc https://pinboard.in/ https://pinboard.in/u:jm/b:d3eead7d37e0/ Ask the Decoder: Did I sign up for a global sleep study? 2015-03-09T17:34:21+00:00 http://america.aljazeera.com/articles/2014/10/29/sleep-study.html jmHow meaningful is this corporate data science, anyway? Given the tech-savvy people in the Bay Area, Jawbone likely had a very dense sample of Jawbone wearers to draw from for its Napa earthquake analysis. That allowed it to look at proximity to the epicenter of the earthquake from location information. Jawbone boasts its sample population of roughly “1 million Up wearers who track their sleep using Up by Jawbone.” But when looking into patterns county by county in the U.S., Jawbone states, it takes certain statistical liberties to show granularity while accounting for places where there may not be many Jawbone users. So while Jawbone data can show us interesting things about sleep patterns across a very large population, we have to remember how selective that population is. Jawbone wearers are people who can afford a $129 wearable fitness gadget and the smartphone or computer to interact with the output from the device. Jawbone is sharing what it learns with the public, but think of all the public health interests or other third parties that might be interested in other research questions from a large scale data set. Yet this data is not collected with scientific processes and controls and is not treated with the rigor and scrutiny that a scientific study requires. Jawbone and other fitness trackers don’t give us the option to use their devices while opting out of contributing to the anonymous data sets they publish. Maybe that ought to change. ]]> jawbone privacy data-protection anonymization aggregation data medicine health earthquakes statistics iot wearables https://pinboard.in/ https://pinboard.in/u:jm/b:b2cff21e8284/ HdrHistogram: A better latency capture method 2015-02-16T11:27:54+00:00 http://psy-lob-saw.blogspot.ie/2015/02/hdrhistogram-better-latency-capture.html jm hdrhistogram hdr histograms statistics latency measurement metrics percentiles quantiles gil-tene nitsan-wakart https://pinboard.in/ https://pinboard.in/u:jm/b:fe1e9f2ecc3d/ scumbag data scientist memes 2015-01-31T11:10:18+00:00 http://www.quickmeme.com/scumbag-data-scientist jm funny data-science statistics machine-learning hadoop bayes memes image-macros https://pinboard.in/ https://pinboard.in/u:jm/b:c2158da0bb75/ Stop Playing Monopoly With Your Kids (And Play These Games Instead) | FiveThirtyEight 2015-01-26T16:16:07+00:00 http://fivethirtyeight.com/features/stop-playing-monopoly-with-your-kids-and-play-these-games-instead/ jm boardgames games kids children 538 statistics ratings https://pinboard.in/ https://pinboard.in/u:jm/b:72d6c2d53a07/ Schneier on Security: Why Data Mining Won't Stop Terror 2015-01-12T15:07:56+00:00 https://www.schneier.com/essays/archives/2005/03/why_data_mining_wont.html jmThis unrealistically accurate system will generate 1 billion false alarms for every real terrorist plot it uncovers. Every day of every year, the police will have to investigate 27 million potential plots in order to find the one real terrorist plot per month. Raise that false-positive accuracy to an absurd 99.9999 percent and you're still chasing 2,750 false alarms per day -- but that will inevitably raise your false negatives, and you're going to miss some of those 10 real plots. Also, Ben Goldacre saying the same thing: http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/]]> internet scanning filtering specificity statistics data-mining terrorism law nsa gchq false-positives false-negatives https://pinboard.in/ https://pinboard.in/u:jm/b:40691f3d07b8/ Introducing practical and robust anomaly detection in a time series 2015-01-07T22:42:32+00:00 https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series jmEarly detection of anomalies plays a key role in ensuring high-fidelity data is available to our own product teams and those of our data partners. This package helps us monitor spikes in user engagement on the platform surrounding holidays, major sporting events or during breaking news. Beyond surges in social engagement, exogenic factors – such as bots or spammers – may cause an anomaly in number of favorites or followers. The package can be used to find such bots or spam, as well as detect anomalies in system metrics after a new software release. We’re open-sourcing AnomalyDetection because we’d like the public community to evolve the package and learn from it as we have. ]]> statistics twitter r anomaly-detection outliers metrics time-series spikes holt-winters https://pinboard.in/ https://pinboard.in/u:jm/b:569f792516a5/ UncertML 2015-01-07T00:17:32+00:00 http://www.uncertml.org/ jma conceptual model, with accompanying XML schema, that may be used to quantify and exchange complex uncertainties in data. The interoperable model can be used to describe uncertainty in a variety of ways including: Samples Statistics including mean, variance, standard deviation and quantile Probability distributions including marginal and joint distributions and mixture models ]]> via:conor uncertainty statistics xml formats https://pinboard.in/ https://pinboard.in/u:jm/b:d25e81904ab8/ 'Uncertain<T>: A First-Order Type for Uncertain Data' [paper, PDF] 2014-12-28T23:20:32+00:00 http://www.cs.utexas.edu/users/mckinley/papers/uncertainty-asplos-2014.pdf jm, a new programming language abstraction for uncertain data. We implement a Bayesian network semantics for computation and conditionals that improves program correctness. The runtime uses sampling and hypothesis tests to evaluate computation and conditionals lazily and efficiently. We illustrate with sensor and machine learning applications that Uncertain improves expressiveness and accuracy.' (via Tony Finch)]]> uncertainty estimation types strong-typing coding probability statistics machine-learning sampling via:fanf https://pinboard.in/ https://pinboard.in/u:jm/b:4e691997eba0/ "The Programming Language Wars - Questions And Responsibilities for the Programming Language Community" 2014-12-04T11:16:06+00:00 http://www.codemesh.io/static/upload/media/141562653162935languagewars.pdf jm statistics data coding languages static-typing dynamic https://pinboard.in/ https://pinboard.in/u:jm/b:87eae3c5fd2f/ Life expectancy increases are due mainly to healthier children, not longer old age 2014-11-11T10:35:33+00:00 http://www.ssa.gov/history/lifeexpect.html jm via:fplogue statistics taxes life-expectancy pensions infant-mortality health 1930s https://pinboard.in/ https://pinboard.in/u:jm/b:ee24e0e95004/ FelixGV/tehuti 2014-10-09T10:53:00+00:00 https://github.com/FelixGV/tehuti jm asl2 apache open-source tehuti metrics percentiles quantiles statistics measurement latency kafka voldemort linkedin https://pinboard.in/ https://pinboard.in/u:jm/b:a2f55ebce7bb/ Tehuti 2014-10-08T09:45:50+00:00 https://groups.google.com/forum/#!msg/project-voldemort/Y52UyHQ8tBA/9Ei79_RvS3EJ jm kafka metrics dropwizard java scala jvm timers ewma statistics measurement latency sampling tehuti voldemort linkedin jay-kreps https://pinboard.in/ https://pinboard.in/u:jm/b:b56664c1a098/ tinystat - GoDoc 2014-09-21T22:20:09+00:00 http://godoc.org/github.com/codahale/tinystat/cmd/tinystat jmtinystat is used to compare two or more sets of measurements (e.g., runs of a multiple runs of benchmarks of two possible implementations) and determine if they are statistically different, using Student's t-test. It's inspired largely by FreeBSD's ministat (written by Poul-Henning Kamp). ]]> t-test student statistics go coda-hale tinystat stats tools command-line unix https://pinboard.in/ https://pinboard.in/u:jm/b:4f23dc3f0640/ CausalImpact: A new open-source package for estimating causal effects in time series 2014-09-15T10:50:48+00:00 http://google-opensource.blogspot.ie/2014/09/causalimpact-new-open-source-package.html jmHow can we measure the number of additional clicks or sales that an AdWords campaign generated? How can we estimate the impact of a new feature on app downloads? How do we compare the effectiveness of publicity across countries? In principle, all of these questions can be answered through causal inference. In practice, estimating a causal effect accurately is hard, especially when a randomised experiment is not available. One approach we've been developing at Google is based on Bayesian structural time-series models. We use these models to construct a synthetic control — what would have happened to our outcome metric in the absence of the intervention. This approach makes it possible to estimate the causal effect that can be attributed to the intervention, as well as its evolution over time. We've been testing and applying structural time-series models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control. Today, we're excited to announce the release of CausalImpact, an open-source R package that makes causal analyses simple and fast. With its release, all of our advertisers and users will be able to use the same powerful methods for estimating causal effects that we've been using ourselves. Our main motivation behind creating the package has been to find a better way of measuring the impact of ad campaigns on outcomes. However, the CausalImpact package could be used for many other applications involving causal inference. Examples include problems found in economics, epidemiology, or the political and social sciences. ]]> causal-inference r google time-series models bayes adwords advertising statistics estimation metrics https://pinboard.in/ https://pinboard.in/u:jm/b:a62b2f300071/ Punished for Being Poor: Big Data in the Justice System 2014-08-19T13:20:55+00:00 http://www.psmag.com/navigation/politics-and-law/punished-poor-problem-using-big-data-justice-system-88651/ jmCurrently, over 20 states use data-crunching risk-assessment programs for sentencing decisions, usually consisting of proprietary software whose exact methods are unknown, to determine which individuals are most likely to re-offend. The Senate and House are also considering similar tools for federal sentencing. These data programs look at a variety of factors, many of them relatively static, like criminal and employment history, age, gender, education, finances, family background, and residence. Indiana, for example, uses the LSI-R, the legality of which was upheld by the state’s supreme court in 2010. Other states use a model called COMPAS, which uses many of the same variables as LSI-R and even includes high school grades. Others are currently considering the practice as a way to reduce the number of inmates and ensure public safety. (Many more states use or endorse similar assessments when sentencing sex offenders, and the programs have been used in parole hearings for years.) Even the American Law Institute has embraced the practice, adding it to the Model Penal Code, attesting to the tool’s legitimacy. (via stroan)]]> via:stroan statistics false-positives big-data law law-enforcement penal-code risk sentencing https://pinboard.in/ https://pinboard.in/u:jm/b:26eaf70354e0/ Monitoring Reactive Applications with Kamon 2014-05-19T10:41:38+00:00 http://kamon.io/presentations/javacro14/#/ jm metrics dropwizard hdrhistogram gil-tene kamon akka spray play reactive statistics java scala percentiles latency https://pinboard.in/ https://pinboard.in/u:jm/b:dd1798009833/ Daylight saving time linked to heart attacks, study finds 2014-03-31T08:41:24+00:00 http://www.irishtimes.com/news/health/daylight-saving-time-linked-to-heart-attacks-study-finds-1.1743441 jmSwitching over to daylight saving time, and losing one hour of sleep, raised the risk of having a heart attack the following Monday by 25 per cent, compared to other Mondays during the year, according to a new US study released today. [...] The study found that heart attack risk fell 21 per cent later in the year, on the Tuesday after the clock was returned to standard time, and people got an extra hour’s sleep. One clear answer: we need 25-hour days. More details: http://www.sciencedaily.com/releases/2014/03/140329175108.htm --

Researchers used Michigan's BMC2 database, which collects data from all non-federal hospitals across the state, to identify admissions for heart attacks requiring percutaneous coronary intervention from Jan. 1, 2010 through Sept. 15, 2013. A total of 42,060 hospital admissions occurring over 1,354 days were included in the analysis. Total daily admissions were adjusted for seasonal and weekday variation, as the rate of heart attacks peaks in the winter and is lowest in the summer and is also greater on Mondays and lower over the weekend. The hospitals included in this study admit an average of 32 patients having a heart attack on any given Monday. But on the Monday immediately after springing ahead there were on average an additional eight heart attacks. There was no difference in the total weekly number of percutaneous coronary interventions performed for either the fall or spring time changes compared to the weeks before and after the time change.

]]> daylight dst daylight-savings time dates calendar science health heart-attacks michigan hospitals statistics https://pinboard.in/ https://pinboard.in/u:jm/b:1e3bf8d8c6b0/ Analyzing Citibike Usage 2014-03-18T14:51:29+00:00 http://abe.is/analyzing-citibike-usage/ jm data correlation statistics citibike cycling nyc data-science weather https://pinboard.in/ https://pinboard.in/u:jm/b:5a3ec788c5c3/ How the search for flight AF447 used Bayesian inference 2014-03-12T15:33:10+00:00 http://www.bea.aero/fr/enquetes/vol.af.447/metron.search.analysis.pdf jm metron bayes bayesian-inference machine-learning statistics via:jgc air-france disasters probability inference searching https://pinboard.in/ https://pinboard.in/u:jm/b:e7c127ca54da/ Sacked Google worker says staff ratings fixed to fit template 2014-03-12T10:59:00+00:00 http://www.irishtimes.com/news/ireland/irish-news/sacked-google-worker-says-staff-ratings-fixed-to-fit-template-1.1721176 jm stack-ranking google ireland employment work bell-curve statistics eric-schmidt https://pinboard.in/ https://pinboard.in/u:jm/b:49351fe530da/ "A data scientist is a ..." 2014-02-01T21:01:57+00:00 https://twitter.com/jeremyjarvis/status/428848527226437632/photo/1 jm data-scientist statistics statistician funny jokes san-francisco tech monkigras https://pinboard.in/ https://pinboard.in/u:jm/b:2030d373aa62/ Nassim Taleb: retire Standard Deviation 2014-01-15T21:32:15+00:00 http://www.edge.org/response-detail/25401 jm statistics standard-deviation stddev maths nassim-taleb deviation volatility rmse distributions https://pinboard.in/ https://pinboard.in/u:jm/b:f9906de54f2e/ "The Top 6 Reasons This Infographic Is Just Wrong Enough To Sound Convincing" 2013-11-06T17:10:13+00:00 http://cf.broadsheet.ie/wp-content/uploads/2013/11/20131106.jpg jm diagrams infographics infoviz visualisation data fail statistics https://pinboard.in/ https://pinboard.in/u:jm/b:e8bf0f36332a/ Statsite 2013-11-01T16:58:31+00:00 http://armon.github.io/statsite/ jmStatsite is designed to be both highly performant, and very flexible. To achieve this, it implements the stats collection and aggregation in pure C, using libev to be extremely fast. This allows it to handle hundreds of connections, and millions of metrics. After each flush interval expires, statsite performs a fork/exec to start a new stream handler invoking a specified application. Statsite then streams the aggregated metrics over stdin to the application, which is free to handle the metrics as it sees fit. This allows statsite to aggregate metrics and then ship metrics to any number of sinks (Graphite, SQL databases, etc). There is an included Python script that ships metrics to graphite. ]]> statsd graphite statsite performance statistics service-metrics metrics ops https://pinboard.in/ https://pinboard.in/u:jm/b:98f01ffaa9cc/ "Effective Computation of Biased Quantiles over Data Streams" [paper] 2013-11-01T16:57:06+00:00 http://www.cs.rutgers.edu/~muthu/bquant.pdf jm Skew is prevalent in many data sources such as IP traffic streams.To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two prob-lems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively, using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the “high-biased” quantiles and the “targeted” quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures.Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over high-speed data streams.

Implemented as a timer-histogram storage system in http://armon.github.io/statsite/ .]]> statistics quantiles percentiles stream-processing skew papers histograms latency algorithms https://pinboard.in/ https://pinboard.in/u:jm/b:54346b7d58f0/ _Availability in Globally Distributed Storage Systems_ [pdf] 2013-09-24T22:08:06+00:00 http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36737.pdf jm via:kragen failure bigtable gfs statistics outages reliability https://pinboard.in/ https://pinboard.in/u:jm/b:bb7d3593288e/ Fat Tails 2013-07-02T20:44:37+00:00 http://vudlab.com/fat-tails.html jmA fat-tailed distribution looks normal but the parts far away from the average are thicker, meaning a higher chance of huge deviations. [...] Fat tails don't mean more variance; just different variance. For a given variance, a higher chance of extreme deviations implies a lower chance of medium ones.

]]> dataviz via:hn statistics visualization distributions fat-tailed kurtosis d3.js javascript variance deviation https://pinboard.in/ https://pinboard.in/u:jm/b:ccd01496776d/ Boundary's Early Warnings alarm 2013-06-27T21:09:22+00:00 http://boundary.com/blog/2013/06/27/announcing-early-warnings/ jm network-monitoring throughput boundary service-metrics alarming ops statistics https://pinboard.in/ https://pinboard.in/u:jm/b:d20187298612/ Not the ‘best in the world’ - The Medical Independent 2013-04-18T13:50:37+00:00 http://www.medicalindependent.ie/20844/news jm'Our maternity services are amongst the best in the world’. This phrase has been much hackneyed since the heartbreaking death of Savita Halappanavar was revealed in mid October. James Reilly and other senior politicians are particularly guilty of citing this inaccurate position. So what is the state of Irish maternity services and how do our figures compare with other comparable countries? Let’s start with the statistics. The bottom line:

Eight deaths per 100,000 is not bad, but it ranks our maternity services far from the best in world and below countries such as Slovakia and Poland.

]]> pro-choice ireland savita medicine health maternity morbidity statistics https://pinboard.in/ https://pinboard.in/u:jm/b:a517efcd5326/