Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmDuckDB as the New jq - Paul Gross’s Blog2024-03-22T09:59:27+00:00
https://www.pgrs.net/2024/03/21/duckdb-as-the-new-jq/
jm
% duckdb -c \
"select license->>'key' as license, count(*) as count \
from 'repos.json' \
group by 1 \
order by count desc"
This is very cool. I need to start looking into using `duckdb` as a go-to CLI tool.]]>duckdb cli linux unix json csv data sql queryinghttps://pinboard.in/https://pinboard.in/u:jm/b:cd7c4f54c76b/Fairly Trained2024-03-20T23:46:59+00:00
https://www.fairlytrained.org/about
jmThere is a divide emerging between two types of generative AI companies: those who get the consent of training data providers, and those who don’t, claiming they have no legal obligation to do so.
We believe there are many consumers and companies who would prefer to work with generative AI companies who train on data provided with the consent of its creators.
Fairly Trained exists to make it clear which companies take a more consent-based approach to training, and are therefore treating creators more fairly.
]]>ai gen-ai training ml data consenthttps://pinboard.in/https://pinboard.in/u:jm/b:d9e249728bfe/Pete Hunt's contrarian RDBMS tips2023-12-18T15:11:17+00:00
https://twitter.com/floydophone/status/1708567151953743903
jm
1. It's often better to add tables than alter existing ones. This is especially true in a larger company. Making changes to core tables that other teams depend on is very risky and can be subject to many approvals. This reduces your team's agility a lot.
Instead, try adding a new table that is wholly owned by your team. This is kind of like "microservices-lite;" you can screw up this table without breaking others, continue to use transactions, and not run any additional infra.
(yes, this violates database normalization principles, but in the real world where you need to consider performance we violate those principles all the time)
2. Think in terms of indexes first. Every single time you write a query, you should first think: "which index should I use?" If no usable index exists, create it (or create a separate table with that index, see point 1). When writing the query, add a comment naming the index.
Before you commit any queries to the codebase, write a script to fill up your local development DB with 100k+ rows, and run EXPLAIN on your query. If it doesn't use that index, it's not ready to be committed. Baking this into an automated test would be better, but is hard to do.
3. Consider moving non-COUNT(*) aggregations out of the DB. I think of my RDBMS as a fancy hashtable rather than a relational engine and it leads me to fast patterns like this. Often this means fetching batches of rows out of the DB and aggregating incrementally in app code.
(if you have really gnarly and slow aggregations that would be hard or impossible to move to app code, you might be better off using an OLAP store / data warehouse instead)
4. Thinking in terms of "node" and "edge" tables can be useful. Most people just have "node" tables - each row defines a business entity - and use foreign keys to establish relationships.
Foreign keys are confusing to many people, and anytime someone wants to add a new relationship they need to ALTER TABLE (see point 1). Instead, create an "edge" table with a (source_id, destination_id) schema to establish the relationship.
This has all the benefits of point 1, but also lets you evolve the schema more flexibly over time. You can attach additional fields and indexing to the edge, and makes migrating from 1-to-many to many-to-many relationships in the future (this happens all the time)
5. Usually every table needs "created_at" and/or "updated_at" columns. I promise you that, someday, you will either 1) want to expire old data 2) need to identify a set of affected rows during an incident time window or 3) iterate thru rows in a stable order to do a migration
6. Choosing how IDs are structured is super important. Never use autoincrement. Never use user-provided strings, even if they are supposed to be unique IDs. Always use at least 64 bits. Snowflake IDs (https://en.wikipedia.org/wiki/Snowflake_ID) or ULIDs (https://github.com/ulid/spec) are a great choice.
7. Comment your queries so debugging prod issues is easier. Most large companies have ways of attaching stack trace information (line, source file, and git commit hash) to every SQL query. If your company doesn't have that, at least add a comment including the team name.
Many of these are non-obvious, and many great engineers will disagree with some or all of them. And, of course, there are situations when you should not follow them. YMMV!
Number 5 is absolutely, ALWAYS true, in my experience. And I love the idea of commenting queries... must follow more of these.]]>rdbms databases oltp data querying storage architecturehttps://pinboard.in/https://pinboard.in/u:jm/b:171bde124c55/Anatomy of an AI System2023-11-10T10:22:52+00:00
https://anatomyof.ai/
jmAt this moment in the 21st century, we see a new form of extractivism that is well underway: one that reaches into the furthest corners of the biosphere and the deepest layers of human cognitive and affective being. Many of the assumptions about human life made by machine learning systems are narrow, normative and laden with error. Yet they are inscribing and building those assumptions into a new world, and will increasingly play a role in how opportunities, wealth, and knowledge are distributed.
The stack that is required to interact with an Amazon Echo goes well beyond the multi-layered ‘technical stack’ of data modeling, hardware, servers and networks. The full stack reaches much further into capital, labor and nature, and demands an enormous amount of each. The true costs of these systems – social, environmental, economic, and political – remain hidden and may stay that way for some time.
]]>ai amazon echo extractivism ml data future capitalismhttps://pinboard.in/https://pinboard.in/u:jm/b:92ea9880f3a0/ESB HDF Reader2023-10-04T16:42:26+00:00
https://github.com/dresdner353/energyutils/blob/main/ESB_HDF_READER.md
jmformats json csv hdf esb power feed-in-tarriff ireland open-data datahttps://pinboard.in/https://pinboard.in/u:jm/b:4cd7358f7373/eSIMs for data roaming in the US2023-05-10T08:44:10+00:00
https://airalo.com/
jmmobile esims data roaming data-roaming travel usahttps://pinboard.in/https://pinboard.in/u:jm/b:4130c447b8c2/OpenAI’s hunger for data is coming back to bite it2023-04-20T09:56:55+00:00
https://www.technologyreview.com/2023/04/19/1071789/openais-hunger-for-data-is-coming-back-to-bite-it/?truid=8c8f2699f50eb3b9985a111121cfee47&mc_cid=8f246dd37f&mc_eid=eaf496ebe1
jmThe company could have saved itself a giant headache by building in robust data record-keeping from the start, she says. Instead, it is common in the AI industry to build data sets for AI models by scraping the web indiscriminately and then outsourcing the work of removing duplicates or irrelevant data points, filtering unwanted things, and fixing typos. These methods, and the sheer size of the data set, mean tech companies tend to have a very limited understanding of what has gone into training their models.
]]>training data provenance ai ml common-crawl openai chatgpt data-protection privacyhttps://pinboard.in/https://pinboard.in/u:jm/b:8a1016bb2d53/Timnit Gebru's anti-'AI pause'2023-04-14T21:59:25+00:00
https://www.politico.com/newsletters/digital-future-daily/2023/04/11/timnit-gebrus-anti-ai-pause-00091450
jmWhat is your appeal to policymakers? What would you want Congress and regulators to do now to address the concerns you outline in the open letter?
Congress needs to focus on regulating corporations and their practices, rather than playing into their hype of “powerful digital minds.” This, by design, ascribes agency to the products rather than the organizations building them. This language obfuscates the amount of data that is being collected — and the amount of worker exploitation involved with those who are labeling and supplying the datasets, and moderating model outputs.
Congress needs to ensure corporations are not using people’s data without their consent, and hold them responsible for the synthetic media they produce — whether it is text or media spewing disinformation, hate speech or other types of harmful content. Regulations need to put the onus on corporations, rather than understaffed agencies. There are probably existing regulations these organizations are breaking. There are mundane “AI” systems being used daily; we just heard about another Black man being wrongfully arrested because of the use of automated facial analysis systems. But that’s not what we’re talking about, because of the hype.
]]>data privacy ai ml openai monopolyhttps://pinboard.in/https://pinboard.in/u:jm/b:c02b89d4a2b8/AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability2022-10-03T16:33:50+00:00
https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/
jmSimon Willison created a Datasette browser to explore WebVid-10M, one of the two datasets used to train the video generation model, and quickly learned that all 10.7 million video clips were scraped from Shutterstock, watermarks and all.
In addition to the Shutterstock clips, Meta also used 10 million video clips from this 100M video dataset from Microsoft Research Asia. It’s not mentioned on their GitHub, but if you dig into the paper, you learn that every clip came from over 3 million YouTube videos.
So, in addition to a massive chunk of Shutterstock’s video collection, Meta is also using millions of YouTube videos collected by Microsoft to make its text-to-video AI.
]]>ai data ethics fair-use copyright ip traininghttps://pinboard.in/https://pinboard.in/u:jm/b:e5688ab18c6a/Facebook Engineers Don’t Know Where They Keep Your Data2022-09-08T10:56:35+00:00
https://theintercept.com/2022/09/07/facebook-personal-data-no-accountability/
jmIn the March 2022 hearing, Zarashaw and Steven Elia, a software engineering manager, described Facebook as a data-processing apparatus so complex that it defies understanding from within. The hearing amounted to two high-ranking engineers at one of the most powerful and resource-flush engineering outfits in history describing their product as an unknowable machine. [...]
The fundamental problem, according to the engineers in the hearing, is that [...] the company never bothered to cultivate institutional knowledge of how each of these component systems works, what they do, or who’s using them.
]]>data engineering facebook meta privacy fail culture workhttps://pinboard.in/https://pinboard.in/u:jm/b:f425923a47a2/Data Broker Is Selling Location Data of People Who Visit Abortion Clinics2022-05-04T10:03:59+00:00
https://www.vice.com/en/article/m7vzjb/location-data-abortion-clinics-safegraph-planned-parenthood
jmabortion capitalism data data-privacy privacy location safegraphhttps://pinboard.in/https://pinboard.in/u:jm/b:785d2d9cd559/An Post & OSI company “GeoDirectory” uses Census data to profile every Irish home2022-05-03T08:39:42+00:00
https://www.iccl.ie/news/iccl-data-complaint-an-post-osi-company-profiles-irish-peoples-income-and-homes-and-sells-data-to-data-brokers-and-insurance-companies/
jmThe Irish Council for Civil Liberties (ICCL) reveals that An Post & OSI use Census data to profile every Irish home and sell ‘location intelligence’ to data brokers and insurance companies. ICCL has lodged a complaint with the Data Protection Commission.
In particular, buyers include Experian, one of the world's biggest data brokers. There's no way this meets the spirit, if not the word, of the GDPR, there's no data privacy here.]]>data-privacy data-protection iccl experian an-post osi census data privacy irelandhttps://pinboard.in/https://pinboard.in/u:jm/b:f47b892b723f/'I think I discovered a military base in the middle of the ocean' -- Null Island, the most real of fictional places2022-04-19T15:28:19+00:00
https://arxiv.org/abs/2204.08383
jmThis paper explores Null Island, a fictional place located at 0∘ latitude and 0∘ longitude in the WGS84 geographic coordinate system. Null Island is erroneously associated with large amounts of geographic data in a wide variety of location-based services, place databases, social media and web-based maps. While it was originally considered a joke within the geospatial community, this article will demonstrate implications of its existence, both technological and social in nature, promoting Null Island as a fundamental issue of geographic information that requires more widespread awareness. The article summarizes error sources that lead to data being associated with Null Island. We identify four evolutionary phases which help explain how this fictional place evolved and established itself as an entity reaching beyond the geospatial profession to the point of being discovered by the visual arts and the general population. After providing an accurate account of data that can be found at (0, 0), geospatial, technological and social implications of Null Island are discussed. Guidelines to avoid misplacing data to Null Island are provided. Since data will likely continue to appear at this location, our contribution is aimed at both GIScientists and the general population to promote awareness of this error source.]]>mapping gis null null-island geography misclassification datahttps://pinboard.in/https://pinboard.in/u:jm/b:611115fca228/Forrest Fleischman on "trillion trees" projects2021-10-03T21:25:47+00:00
https://twitter.com/ForrestFleisch1/status/1444008823350603780
jmA project whose goal is to plant a certain number of trees is particularly vulnerable to failure because its counting the wrong thing.
If the goal is to absorb emissions, we should count the carbon, not the trees. A few small large absorb more carbon than a bunch of little trees.
When we plant trees with carbon uptake or forest restoration as a goal, we don't try to maximize the number of trees. We try to maximize long-term carbon uptake, and this might actually mean planting fewer trees up front.
]]>forestry science data climate-change planting trees forests carbon-capture carbonhttps://pinboard.in/https://pinboard.in/u:jm/b:0d01467d826c/Big tech relies on refugee labour2021-09-30T09:19:57+00:00
https://restofworld.org/2021/refugees-machine-learning-big-tech/
jmAll of the largest companies in the world are today powered by a covert crowd of the system’s castoffs. Platforms have found amid those struggling to stay afloat in informal work — or else barely clinging onto a life in formal employment — a desperate mass to be tempted with the promise of a better life. Such a promise, however, is broken as soon as it is made; the petty services of the informal sector resemble little more than a blueprint for the microtasks of big tech, without offering anything in the way of rights, routine, role, security, or a future.
]]>colonialism refugees ai data machine-learning amazon google tesla uber mechanical-turkhttps://pinboard.in/https://pinboard.in/u:jm/b:74e3b26de6c2/CR2032 battery review2021-09-30T08:37:14+00:00
https://tanasaro10.blogspot.com/2017/11/cr2032-batteries-review.html?m=1
jmcr2032 batteries data power via:itc ikeahttps://pinboard.in/https://pinboard.in/u:jm/b:e85a1ba3316b/The real story of the Afghan biometric databases abandoned to the Taliban2021-08-30T13:19:51+00:00
https://www.technologyreview.com/2021/08/30/1033941/afghanistan-biometric-databases-us-military-40-data-points/
jm
APPS contains some half a million records about every member of the Afghan National Army and Afghan National Police, according to estimates by individuals familiar with the program. The data is collected “from the day they enlisted,” says one individual that worked on the system, and remains in the system forever, whether or not someone remains actively in service.. Records could be updated, he added, but there was no deletion or data retention policy — not even in contingency situations, such as a Taliban takeover. [...]
It also contains details on the individual’s military specialty and career trajectory, as well as sensitive relational data such as the names of their father, uncles, and grandfathers, as well as the names of the two tribal elders per recruit that served as guarantors for their enlistment. This turns what was a simple digital catalog into something far more dangerous, according to Ranjit Singh, a postdoctoral scholar at the non-profit research group Data & Society who studies data infrastructures and public policy. He calls it a sort of “genealogy” of “community connections” that is “putting all of these people at risk.” [...]
The information is also of deep military value — whether for the Americans that helped construct it or for the Taliban, both of whom are “looking for networks” of their opponent’s supporters. [....]
"Give me a field that you think we will not collect, and I'll tell you you're wrong," said one of the individuals involved. [...]
Singh says the issue of what happens to data during conflicts or governmental collapse needs to be given more attention. “We don't take it seriously,” he says, “But we should, especially in these war-torn areas where information can be used to create a lot of havoc.”
Kak, the biometrics law researcher, suggests the best way to protect sensitive data may actually be that “ these kinds of [data] infrastructures... weren't built in the first place.”
]]>data data-protection data-privacy data-retention afghanistan taliban security fail biometricshttps://pinboard.in/https://pinboard.in/u:jm/b:a105996b6722/Hundreds of AI tools have been built to catch covid. None of them helped. | MIT Technology Review2021-08-02T21:17:22+00:00
https://www.technologyreview.com/2021/07/30/1030329/machine-learning-ai-failed-covid-hospital-diagnosis-pandemic/
jmvia:doctorow covid-19 ai ml fail data health medicine statistcshttps://pinboard.in/https://pinboard.in/u:jm/b:a882f38826ec/a great demo of log-scale graphing2021-07-07T09:44:37+00:00
https://twitter.com/BristOliver/status/1356150658337009666
jmlog-scale logs graphs charts graphing oliver-johnson dataviz datahttps://pinboard.in/https://pinboard.in/u:jm/b:c5d194e29ab4/Trino on Ice IV: Deep Dive Into Iceberg Internals2021-06-09T09:01:17+00:00
https://blog.starburst.io/trino-on-ice-iv-deep-dive-into-iceberg-internals
jmtrino iceberg data big-data data-lakes formats s3 avro orchttps://pinboard.in/https://pinboard.in/u:jm/b:ccbdfd8f2e06/Tabula2021-05-17T13:54:11+00:00
https://tabula.technology/
jmconverter data pdf tools cli tabula tables csv extractionhttps://pinboard.in/https://pinboard.in/u:jm/b:d5b17972558b/Home Assistant Data Science2020-11-24T09:42:58+00:00
https://data.home-assistant.io/
jmThe Home Assistant Data Science portal is your one stop shop to get started exploring the data of your home. We will teach you about the data that Home Assistant tracks for you and we'll get you up and running with Jupyter Lab, a data science environment, to explore your own data.
]]>docs data home-assistant iot data-science graphs o11y home hanhttps://pinboard.in/https://pinboard.in/u:jm/b:8eb340cd93cb/q - Text as Data2020-10-21T09:56:28+00:00
http://harelba.github.io/q/
jmcsv database sql cli data tools unix tsvhttps://pinboard.in/https://pinboard.in/u:jm/b:bbf7da485984/WebPlotDigitizer2020-10-01T09:17:17+00:00
https://automeris.io/WebPlotDigitizer/
jmIt is often necessary to reverse engineer images of data visualizations to extract the underlying numerical data. WebPlotDigitizer is a semi-automated tool that makes this process extremely easy:
Works with a wide variety of charts (XY, bar, polar, ternary, maps etc.)
Automatic extraction algorithms make it easy to extract a large number of data points
Free to use, opensource and cross-platform (web and desktop)
Used in hundreds of published works by thousands of users
Also useful for measuring distances or angles between various features
]]>data-extraction scraping tools data chartshttps://pinboard.in/https://pinboard.in/u:jm/b:923a692e4a95/Apache Arrow2020-07-30T22:05:19+00:00
https://www.dremio.com/apache-arrow-explained/
jmArrow combines the benefits of columnar data structures with in-memory computing. It provides the performance benefits of these modern techniques while also providing the flexibility of complex data and dynamic schemas. And it does all of this in an open source and standardized way.
(via Tony Finch)]]>via:fanf arrow data formats compression columnar-storage storage librarieshttps://pinboard.in/https://pinboard.in/u:jm/b:fa06288fd165/AWS User Data is Being Stored, Used Outside User's Chosen Regions2020-07-27T16:07:48+00:00
https://www.cbronline.com/news/aws-user-data
jm[AWS] is using customers’ “AI content” for its own product development purposes. It also reserves the right in its small print to store this material outside the geographic regions that AWS customers have explicitly selected. It may also share this with AWS “affiliates” it says, without naming them.
]]>via:corey-quinn aws amazon machine-learning corpora training data data-privacy data-protectionhttps://pinboard.in/https://pinboard.in/u:jm/b:972c5d9fad5f/FlexBuffers | Hacker News2020-06-22T13:49:53+00:00
https://news.ycombinator.com/item?id=23588558
jmflatbuffers flexbuffers json encoding data formats file-formats avro protobuf zerocopy sbe schemashttps://pinboard.in/https://pinboard.in/u:jm/b:567d7f5724e6/COVID-19 data researcher removed as Florida moves to re-open state2020-05-19T09:05:23+00:00
https://eu.floridatoday.com/story/news/2020/05/18/censorship-covid-19-data-researcher-removed-florida-moves-re-open-state/5212398002/
jmflorida us-politics covid-19 data-science data datavizhttps://pinboard.in/https://pinboard.in/u:jm/b:286be8543c39/Ireland COVID19 GeoHive2020-03-24T21:59:22+00:00
http://geohive.maps.arcgis.com/apps/opsdashboard/index.html#/a192b58ba6904c1494f651706c223520
jmcovid-19 dashboards arcgis geohive data irelandhttps://pinboard.in/https://pinboard.in/u:jm/b:4cec757d94c4/The sustainable fashion conversation is based on bad statistics and misinformation - Vox2020-02-10T11:45:11+00:00
https://www.vox.com/platform/amp/the-goods/2020/1/27/21080107/fashion-environment-facts-statistics-impact
jmI pulled all of these statistics and other common "facts" from reputable sources. McKinsey. The United Nations. The Ellen MacArthur Foundation. The World Bank. International labor unions. Advocacy organizations. And these facts have been cited by publications like the Wall Street Journal and the New York Times.
Not all of these highly respected experts could be wrong. Could they?
It turns out they could. Because only one out of the dozen or so most commonly cited facts about the fashion industry’s huge footprint is based on any sort of science, data collection, or peer-reviewed research. The rest are based on gut feelings, broken links, marketing, and something someone said in 2003.
]]>bad-data data facts factoids misinformation fashion fast-fashion climate-changehttps://pinboard.in/https://pinboard.in/u:jm/b:7f65d4c2f537/Food types by CO2 footprint2020-01-27T10:12:54+00:00
https://ourworldindata.org/food-choice-vs-eating-local
jmFor most foods – and particularly the largest emitters – most GHG emissions result from land use change (shown in green), and from processes at the farm stage (brown). Farm-stage emissions include processes such as the application of fertilizers – both organic (“manure management”) and synthetic; and enteric fermentation (the production of methane in the stomachs of cattle). Combined, land use and farm-stage emissions account for more than 80% of the footprint for most foods.
Transport is a small contributor to emissions. For most food products, it accounts for less than 10%, and it’s much smaller for the largest GHG emitters. In beef from beef herds, it’s 0.5%. Not just transport, but all processes in the supply chain after the food left the farm – processing, transport, retail and packaging – mostly account for a small share of emissions.
Excellent graph from Our World In Data. tl;dr: beef is massively damaging in terms of emissions, poultry is far less, then fish, then various kinds of veg are at the low end. It's shocking how much impact beef has.]]>co2 food data farming carbon emissions climate-change methane transport locavoreshttps://pinboard.in/https://pinboard.in/u:jm/b:bbc987d21be4/BurntSushi/xsv2020-01-23T14:23:37+00:00
https://github.com/BurntSushi/xsv
jma command line program for indexing, slicing, analyzing, splitting and joining CSV files. Commands should be simple, fast and composable:
Simple tasks should be easy.
Performance trade offs should be exposed in the CLI interface.
Composition should not come at the expense of performance.
]]>rust csv cli tools data xsv command-line unixhttps://pinboard.in/https://pinboard.in/u:jm/b:222359044ad8/A Review of Netflix’s Metaflow2020-01-22T12:14:32+00:00
https://medium.com/bigdatarepublic/a-review-of-netflixs-metaflow-65c6956e168d
jmmetaflow data-science data batch architecturehttps://pinboard.in/https://pinboard.in/u:jm/b:98bc54a96c89/Modin: Speed up your Pandas workflows by changing a single line of code2020-01-08T11:21:27+00:00
https://github.com/modin-project/modin
jmThe modin.pandas DataFrame is an extremely light-weight parallel DataFrame. Modin transparently distributes the data and computation so that all you need to do is continue using the pandas API as you were before installing Modin. Unlike other parallel DataFrame systems, Modin is an extremely light-weight, robust DataFrame. Because it is so light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores.
We have focused heavily on bridging the solutions between DataFrames for small data (e.g. pandas) and large data. Often data scientists require different tools for doing the same thing on different sizes of data. The DataFrame solutions that exist for 1KB do not scale to 1TB+, and the overheads of the solutions for 1TB+ are too costly for datasets in the 1KB range. With Modin, because of its light-weight, robust, and scalable nature, you get a fast DataFrame at small and large data. With preliminary cluster and out of core support, Modin is a DataFrame library with great single-node performance and high scalability in a cluster.
]]>data parallel python pandas dataframes modin data-sciencehttps://pinboard.in/https://pinboard.in/u:jm/b:305a6873731b/City maps from tourists' feelings2020-01-06T14:54:03+00:00
https://barregi.com/airbnbmaps
jmThe aim of this project is to map tourists’ perceptions of different urban areas through data retrieved from vacation rental platform Airbnb. After their stay, Airbnb guests score their feeling about the neighbourhood using a star-based rating system. The aggregated rating of each Airbnb listing is publicly accessible, and given the widespread expansion of this platform, a large amount of data is available for the most visited cities. When overlaid on a map of the city, the data reveals interesting geographic patterns and exposes subjective perceptions on safety, upkeep or convenience. -- Beñat Arregi
]]>airbnb dataviz maps mapping via:nelson data tourism europe vacations holidayshttps://pinboard.in/https://pinboard.in/u:jm/b:fb03ad041c96/simonw/datasette: A tool for exploring and publishing data2019-12-16T11:07:54+00:00
https://github.com/simonw/datasette
jmDatasette is a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API. Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world.
]]>database api json python sqlite data exploring csv tsvhttps://pinboard.in/https://pinboard.in/u:jm/b:d1af06975552/electricityMap2019-10-07T16:33:35+00:00
https://www.electricitymap.org/?page=country&solar=false&remote=true&wind=false&countryCode=IE
jmelectricity statistics graphs data energy climate renewables carbon co2https://pinboard.in/https://pinboard.in/u:jm/b:4a92fe5beab6/Daring Fireball: Siri, Privacy, and Trust2019-08-19T09:59:24+00:00
https://daringfireball.net/2019/08/siri_privacy_trust
jmMy reading of this is that until last week, if you used Siri in any way, your recordings might be used in this “grading” process. If I graded Apple on the privacy and trust implications of this, I’d give them an F.
]]>siri grading privacy data voice ml training fail applehttps://pinboard.in/https://pinboard.in/u:jm/b:fd1502f5ac4a/IBM’s photo-scraping scandal shows what a weird bubble AI researchers live in - MIT Technology Review2019-08-14T12:27:25+00:00
https://www.technologyreview.com/f/613131/ibms-photo-scraping-scandal-shows-what-a-weird-bubble-ai-researchers-live-in/
jmscraping data from publicly available sources is so much of an industry standard that it’s taught as a foundational skill (sans ethics) in most data science and machine-learning training.
[...] this story highlights the need for the tech industry to adapt its cultural norms and standard practices to keep pace with the rapid evolution of the technology itself, as well as the public’s awareness of how their data is used.
]]>scraping privacy data ai big-data data-privacy flickr photos machine-learninghttps://pinboard.in/https://pinboard.in/u:jm/b:aed297db42c6/CarbonKit2019-08-06T12:04:00+00:00
https://www.carbonkit.net/about
jmCarbonKit provides all the data and models necessary for calculating various greenhouse gas emissions in categories such as car, train and air transport, types of fuel or country-specific grid electricity, electrical appliances, agricultural and industrial processes and building materials.
]]>carbon co2 emissions data ghgshttps://pinboard.in/https://pinboard.in/u:jm/b:ab32fedc04ab/Data isn't the new oil, it's the new CO22019-07-25T10:32:25+00:00
https://luminategroup.com/posts/blog/data-isnt-the-new-oil-its-the-new-co2
jmWe should not endlessly be defending arguments along the lines that “people choose to willingly give up their freedom in exchange for free stuff online”. The argument is flawed for two reasons.
First the reason that is usually given - people have no choice but to consent in order to access the service, so consent is manufactured. We are not exercising choice in providing data but rather resigned to the fact that they have no choice in the matter.
The second, less well known but just as powerful, argument is that we are not only bound by other people’s data; we are bound by other people’s consent. In an era of machine learning-driven group profiling, this effectively renders my denial of consent meaningless. Even if I withhold consent, say I refuse to use Facebook or Twitter or Amazon, the fact that everyone around me has joined means there are just as many data points about me to target and surveil. The issue is systemic, it is not one where a lone individual can make a choice and opt out of the system. We perpetuate this myth by talking about data as our own individual “oil”, ready to sell to the highest bidder. In reality I have little control over this supposed resource which acts more like an atmospheric pollutant, impacting me and others in myriads of indirect ways. There are more relations - direct and indirect - between data related to me, data about me, data inferred about me via others than I can possibly imagine, let alone control with the tools we have at our disposal today.
]]>data ethics data-privacy privacy surveillance surveillance-capitalism co2 future profiling consent gdprhttps://pinboard.in/https://pinboard.in/u:jm/b:d5371338436f/ndjson2019-04-25T08:14:24+00:00
https://github.com/ndjson/ndjson-spec
jmjson streaming unix pipes newlines formats interchange data standardshttps://pinboard.in/https://pinboard.in/u:jm/b:a80f59e1e13f/Ireland Blocks The World on Data Privacy2019-04-24T10:23:56+00:00
https://www.politico.eu/interactive/ireland-blocks-the-world-on-data-privacy/
jmLast May, Europe imposed new data privacy guidelines that carry the hopes of hundreds of millions of people around the world — including in the United States — to rein in abuses by big tech companies.
Almost a year later, it’s apparent that the new rules have a significant loophole: The designated lead regulator — the tiny nation of Ireland — has yet to bring an enforcement action against a big tech firm.
That’s not entirely surprising. Despite its vows to beef up its threadbare regulatory apparatus, Ireland has a long history of catering to the very companies it is supposed to oversee, having wooed top Silicon Valley firms to the Emerald Isle with promises of low taxes, open access to top officials, and help securing funds to build glittering new headquarters.
Now, data privacy experts and regulators in other countries are questioning Ireland’s commitment to policing imminent privacy concerns like Facebook’s reintroduction of facial recognition software and data-sharing with its recently purchased subsidiary WhatsApp, and Google’s sharing of information across its burgeoning number of platforms.
]]>ireland fail gdpr privacy data-protection data facebook eu regulationhttps://pinboard.in/https://pinboard.in/u:jm/b:41ea9d499685/Who’s using your face? The ugly truth about facial recognition2019-04-19T12:33:04+00:00
https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e
jmIn order to feed this hungry system, a plethora of face repositories — such as IJB-C — have sprung up, containing images manually culled and bound together from sources as varied as university campuses, town squares, markets, cafés, mugshots and social-media sites such as Flickr, Instagram and YouTube.
To understand what these faces have been helping to build, the FT worked with Adam Harvey, the researcher who first spotted Jillian York’s face in IJB-C. An American based in Berlin, he has spent years amassing more than 300 face datasets and has identified some 5,000 academic papers that cite them.
The images, we found, are used to train and benchmark algorithms that serve a variety of biometric-related purposes — recognising faces at passport control, crowd surveillance, automated driving, robotics, even emotion analysis for advertising. They have been cited in papers by commercial companies including Facebook, Microsoft, Baidu, SenseTime and IBM, as well as by academics around the world, from Japan to the United Arab Emirates and Israel.
“We’ve seen facial recognition shifting in purpose,” says Dave Maass, a senior investigative researcher at the EFF, who was shocked to discover that his own colleagues’ faces were in the Iarpa database. “It was originally being used for identification purposes . . . Now somebody’s face is used as a tracking number to watch them as they move across locations on video, which is a huge shift. [Researchers] don’t have to pay people for consent, they don’t have to find models, no firm has to pay to collect it, everyone gets it for free.”
]]>data privacy face-recognition cameras creative-commons licensing flickr open-data google facebook surveillance instagram ijb-c research iarpahttps://pinboard.in/https://pinboard.in/u:jm/b:eccfc5d1cacb/_First M87 Event Horizon Telescope Results. III. Data Processing and Calibration_2019-04-12T12:20:06+00:00
https://iopscience-event-horizon.s3.amazonaws.com/article/10.3847/2041-8213/ab0c57/The_Event_Horizon_Telescope_Collaboration_2019_ApJL_875_L3.pdf
jmpapers data big-data telescopes eht black-holes astronomyhttps://pinboard.in/https://pinboard.in/u:jm/b:afbc6414cc87/'digital health will lead to forms of enslavement we can barely imagine'2019-02-25T11:22:10+00:00
https://www.independent.ie/life/health-wellbeing/modern-medicine-is-like-the-medieval-church-37749518.html
jmPerhaps most alarming of all is his analysis of the future of the world of digital health - "Anyone with a smartphone will be monitoring themselves, or - more likely - will be monitored by some external agency. Health and life insurance companies will offer financial inducements to people to be monitored, and big corporations will undoubtedly make the wearing of health-tracking devices mandatory. The danger of all of this is that in countries where health care is paid for by insurance, a new underclass of uninsured people will emerge. Digital health," he points out, "is presented as something empowering, but the reality is that it will lead to forms of enslavement that we can barely imagine. Facebook and Google have shown how easily people hand over their privacy and personal data in return for a few shiny trinkets. They have also shown how this personal data can be monetised."
]]>health medicine tracking privacy insurance surveillance datahttps://pinboard.in/https://pinboard.in/u:jm/b:69ae9874a376/Apache Iceberg (incubating)2019-01-14T23:22:27+00:00
https://iceberg.apache.org/
jmIceberg tracks individual data files in a table instead of directories. This allows writers to create data files in-place and only adds files to the table in an explicit commit.
Table state is maintained in metadata files. All changes to table state create a new metadata file and replace the old metadata with an atomic operation. The table metadata file tracks the table schema, partitioning config, other properties, and snapshots of the table contents.
The atomic transitions from one table metadata file to the next provide snapshot isolation. Readers use the latest table state (snapshot) that was current when they load the table metadata and are not affected by changes until they refresh and pick up a new metadata location.
excellent -- this will let me obsolete so much of our own code :)
]]>presto storage s3 hive iceberg apache asf data architecturehttps://pinboard.in/https://pinboard.in/u:jm/b:07b6a7ecf2f8/Deep learning can "discover" new knowledge from scans/images2018-11-19T11:53:02+00:00
https://www.nature.com/articles/s41551-018-0195-0
jmHere, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction.
]]>deep-learning data analysis ml machine-learning health medicine papershttps://pinboard.in/https://pinboard.in/u:jm/b:59bb846c56f1/How do you populate your development databases?2018-11-08T14:56:53+00:00
https://dev.to/jaredsilver/how-do-you-populate-your-development-databases-e8e
jmdatabase data testing system-tests devhttps://pinboard.in/https://pinboard.in/u:jm/b:286cf87eafef/A Closer Look at Experian Big Data and Artificial Intelligence in Durham Police2018-04-09T10:27:17+00:00
https://bigbrotherwatch.org.uk/2018/04/a-closer-look-at-experian-big-data-and-artificial-intelligence-in-durham-police/
jmexperian marketing credit-score data policing uk durham ai statistics crime harthttps://pinboard.in/https://pinboard.in/u:jm/b:888686c06181/tomnomnom/gron2018-04-04T15:54:38+00:00
https://github.com/tomnomnom/gron
jmjson gron grep cli tools data hacking golanghttps://pinboard.in/https://pinboard.in/u:jm/b:4140b2ac1c28/Strava app gives away location of secret US army bases2018-01-28T21:41:36+00:00
https://www.theguardian.com/world/2018/jan/28/fitness-tracking-app-gives-away-location-of-secret-us-army-bases?CMP=share_btn_tw
jmThe details were released by Strava in a data visualisation map that shows all the activity tracked by users of its app, which allows people to record their exercise and share it with others. The map, released in November 2017, shows every single activity ever uploaded to Strava – more than 3 trillion individual GPS data points, according to the company. The app can be used on various devices including smartphones and fitness trackers like Fitbit to see popular running routes in major cities, or spot individuals in more remote areas who have unusual exercise patterns.
]]>strava privacy fail army us-army datahttps://pinboard.in/https://pinboard.in/u:jm/b:840d939daa43/Google Maps’s Moat2017-12-20T22:50:43+00:00
https://www.justinobeirne.com/google-maps-moat
jmgoogle maps apple tom-tom data big-data ml mappinghttps://pinboard.in/https://pinboard.in/u:jm/b:0266cb43e089/Google Thinks I’m Dead - The New York Times2017-12-18T11:26:16+00:00
https://www.nytimes.com/2017/12/16/business/google-thinks-im-dead.html
jmgoogle data correctness bugs errors data-cleanliness accuracyhttps://pinboard.in/https://pinboard.in/u:jm/b:49de605fb96c/Handling GDPR: How to make Kafka Forget2017-12-05T23:05:40+00:00
http://www.benstopford.com/2017/12/04/handling-gdpr-make-kafka-forget/
jmHow do you delete (or redact) data from Kafka? The simplest way to remove messages from Kafka is to simply let them expire. By default Kafka will keep data for two weeks and you can tune this as required. There is also an Admin API that lets you delete messages explicitly if they are older than some specified time or offset. But what if we are keeping data in the log for a longer period of time, say for Event Sourcing use cases or as a source of truth? For this you can make use of Compacted Topics, which allow messages to be explicitly deleted or replaced by key.
Similar applies to Kinesis I would think.]]>kafka kinesis gdpr expiry deleting data privacyhttps://pinboard.in/https://pinboard.in/u:jm/b:7e063febc6e6/S3 Inventory Adds Apache ORC output format and Amazon Athena Integration2017-11-20T10:22:59+00:00
https://aws.amazon.com/about-aws/whats-new/2017/11/s3-inventory-adds-apache-orc-output-format-and-amazon-athena-integration/
jmorc formats data interchange s3 athena outputhttps://pinboard.in/https://pinboard.in/u:jm/b:3ffe967cccc0/A history of the neural net/tank legend in AI, and other examples of reward hacking2017-10-16T15:17:44+00:00
https://twitter.com/gwern/status/919922510073946112
jmgwern history ai machine-learning ml genetic-algorithms neural-networks perceptron learning training data reward-hackinghttps://pinboard.in/https://pinboard.in/u:jm/b:9be04c3b6b9b/Sickness absence associated with shared and open-plan offices--a national cross sectional questionnaire survey. - PubMed - NCBI2017-09-27T13:43:28+00:00
https://www.ncbi.nlm.nih.gov/pubmed/21528171
jmoccupants in open-plan offices (>6 persons) had 62% more days of sickness absence (RR 1.62, 95% CI 1.30-2.02).
]]>health office workplace data sickness open-plan work officeshttps://pinboard.in/https://pinboard.in/u:jm/b:ceef8d564d99/The data for the Irish theory driving test is stored in the US2017-08-28T10:22:48+00:00
https://pbs.twimg.com/media/DIGrjZXWsAEzCX2.jpg
jmprometric data privacy data-protection driving-test ireland theory-testhttps://pinboard.in/https://pinboard.in/u:jm/b:20c9efda5024/The Guardian view on patient data: we need a better approach | Editorial | Opinion | The Guardian2017-07-06T09:54:02+00:00
https://www.theguardian.com/commentisfree/2017/jul/05/the-guardian-view-on-patient-data-we-need-a-better-approach
jm
The use of privacy law to curb the tech giants in this instance, or of competition law in the case of the EU’s dispute with Google, both feel slightly maladapted. They do not address the real worry. It is not enough to say that the algorithms DeepMind develops will benefit patients and save lives. What matters is that they will belong to a private monopoly which developed them using public resources. If software promises to save lives on the scale that drugs now can, big data may be expected to behave as big pharma has done. We are still at the beginning of this revolution and small choices now may turn out to have gigantic consequences later. A long struggle will be needed to avoid a future of digital feudalism. Dame Elizabeth’s report is a welcome start.
Hear hear.
]]>privacy law uk nhs data google deepmind healthcare tech open-sourcehttps://pinboard.in/https://pinboard.in/u:jm/b:ac7e968296a3/GDPR Advisors and Consultants - Data Compliance Europe2017-05-31T15:20:42+00:00
http://www.datacomplianceeurope.eu/
jmOur consultancy helps our clients understand how EU privacy law applies to their organisations; delivers the practical and concrete steps needed to achieve legal compliance; and helps them manage their continuing obligations after GDPR comes into force. Our structured approach to GDPR provides a long-term data compliance framework to minimise the ongoing risk of potential fines for data protection breaches. Our continuing partnership provides regulator liaison, advisory consultancy, and external Data Protection Officer services.
]]>gdpr simon-mcgarr law privacy eu europe data-protection regulation datahttps://pinboard.in/https://pinboard.in/u:jm/b:a6348c94c2dc/GraphQL2017-05-26T12:05:55+00:00
http://graphql.org/
jma query language for APIs and a runtime for fulfilling those queries with your existing data. GraphQL provides a complete and understandable description of the data in your API, gives clients the power to ask for exactly what they need and nothing more, makes it easier to evolve APIs over time, and enables powerful developer tools.
Now being used by Facebook and Github -- looks quite interesting.]]>apis data github facebook graphql languages typeshttps://pinboard.in/https://pinboard.in/u:jm/b:815c00eb23ac/Seeking medical abortions online is safe and effective, study finds | World news | The Guardian2017-05-17T10:45:39+00:00
https://www.theguardian.com/world/2017/may/17/seeking-medical-abortions-online-is-safe-and-effective-study-finds
jmOf the 1,636 women who were sent the drugs between the start of 2010 and the end of 2012, the team were able to analyse self-reported data from 1,000 individuals who confirmed taking the pills. All were less than 10 weeks pregnant.
The results reveal that almost 95% of the women successfully ended their pregnancy without the need for surgical intervention. None of the women died, although seven women required a blood transfusion and 26 needed antibiotics.
Of the 93 women who experienced symptoms for which the advice was to seek medical attention, 95% did so, going to a hospital or clinic.
“When we talk about self-sought, self-induced abortion, people think about coat hangers or they think about tables in back alleys,” said Aiken. “But I think this research really shows that in 2017 self-sourced abortion is a network of people helping and supporting each other through what’s really a safe and effective process in the comfort of their own homes, and I think is a huge step forward in public health.”
]]>health medicine abortion pro-choice data women-on-web ireland law repealthe8thhttps://pinboard.in/https://pinboard.in/u:jm/b:c52b564461e0/The great British Brexit robbery: how our democracy was hijacked | Technology | The Guardian2017-05-08T11:13:29+00:00
https://www.theguardian.com/technology/2017/may/07/the-great-british-brexit-robbery-hijacked-democracy?CMP=share_btn_tw
jm
A map shown to the Observer showing the many places in the world where SCL and Cambridge Analytica have worked includes Russia, Lithuania, Latvia, Ukraine, Iran and Moldova. Multiple Cambridge Analytica sources have revealed other links to Russia, including trips to the country, meetings with executives from Russian state-owned companies, and references by SCL employees to working for Russian entities.
Article 50 has been triggered. AggregateIQ is outside British jurisdiction. The Electoral Commission is powerless. And another election, with these same rules, is just a month away. It is not that the authorities don’t know there is cause for concern. The Observer has learned that the Crown Prosecution Service did appoint a special prosecutor to assess whether there was a case for a criminal investigation into whether campaign finance laws were broken. The CPS referred it back to the electoral commission. Someone close to the intelligence select committee tells me that “work is being done” on potential Russian interference in the referendum.
Gavin Millar, a QC and expert in electoral law, described the situation as “highly disturbing”. He believes the only way to find the truth would be to hold a public inquiry. But a government would need to call it. A government that has just triggered an election specifically to shore up its power base. An election designed to set us into permanent alignment with Trump’s America. [....]
This isn’t about Remain or Leave. It goes far beyond party politics. It’s about the first step into a brave, new, increasingly undemocratic world.]]>elections brexit trump cambridge-analytica aggregateiq scary analytics data targeting scl ukip democracy grim-meathook-futurehttps://pinboard.in/https://pinboard.in/u:jm/b:b8b70fae2685/'Mathwashing,' Facebook and the zeitgeist of data worship2017-04-20T20:31:18+00:00
https://technical.ly/brooklyn/2016/06/08/fred-benenson-mathwashing-facebook-data-worship/
jmFred Benenson: Mathwashing can be thought of using math terms (algorithm, model, etc.) to paper over a more subjective reality. For example, a lot of people believed Facebook was using an unbiased algorithm to determine its trending topics, even if Facebook had previously admitted that humans were involved in the process.
]]>maths math mathwashing data big-data algorithms machine-learning bias facebook fred-benensonhttps://pinboard.in/https://pinboard.in/u:jm/b:023846bef835/Tad2017-04-05T20:10:04+00:00
http://tadviewer.com/
jmdataviz osx csv data pivot-tables analysis desktophttps://pinboard.in/https://pinboard.in/u:jm/b:c94b3ec6b1d6/pachyderm2017-02-20T10:48:24+00:00
https://github.com/pachyderm/pachyderm
jmThere are two bold new ideas in Pachyderm:
Containers as the core processing primitive
Version Control for data
These ideas lead directly to a system that's much more powerful, flexible and easy to use.
To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it's all just going in a container! Pachyderm will take your container and inject data into it. We'll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed grep).
Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, can track changes and diffs, collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!
Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!
]]>analytics data containers golang pachyderm tools data-science docker version-controlhttps://pinboard.in/https://pinboard.in/u:jm/b:29d3b1dc41d5/Data from pacemaker used to arrest man for arson, insurance fraud2017-02-05T22:26:20+00:00
http://www.zdnet.com/article/data-from-pacemaker-used-to-arrest-man-for-arson-insurance-fraud/
jmCompton has medical conditions which include an artificial heart linked to an external pump. According to court documents, a cardiologist said that "it is highly improbable Mr. Compton would have been able to collect, pack and remove the number of items from the house, exit his bedroom window and carry numerous large and heavy items to the front of his residence during the short period of time he has indicated due to his medical conditions."
After US law enforcement caught wind of this peculiar element to the story, police were able to secure a search warrant and collect the pacemaker's electronic records to scrutinize his heart rate, the demand on the pacemaker and heart rhythms prior to and at the time of the incident.
]]>pacemakers health medicine privacy data arson insurance fraud hearthttps://pinboard.in/https://pinboard.in/u:jm/b:ed5294e50e3a/The Rise of the Data Engineer2017-01-25T10:17:54+00:00
https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-91be18f1e603#.lt9nm6ls7
jmdata-engineering engineering coding data big-data airbnb maxime-beauchemin data-warehousehttps://pinboard.in/https://pinboard.in/u:jm/b:881a74c1c0c7/Sankey diagram - Wikipedia2017-01-24T10:23:04+00:00
https://en.wikipedia.org/wiki/Sankey_diagram
jmsankey diagrams dataviz data vizhttps://pinboard.in/https://pinboard.in/u:jm/b:8047b58aa24e/