Pinboard (arnicas)
https://pinboard.in/u:arnicas/public/
recent bookmarks from arnicasData exploration and filtering with Nomic Atlas2024-03-23T08:51:21+00:00
https://huggingface.co/blog/visheratin/nomic-data-cleaning
arnicasumap data cleaning clustering articleshttps://pinboard.in/https://pinboard.in/u:arnicas/b:2daf23302708/NousResearch/Genstruct-7B · Hugging Face2024-03-08T07:53:34+00:00
https://huggingface.co/NousResearch/Genstruct-7B
arnicasmodels data augmentation generation text questionshttps://pinboard.in/https://pinboard.in/u:arnicas/b:3589c4fa350c/guidelines1995.pdf2024-02-21T13:07:14+00:00
https://cidoc.mini.icom.museum/wp-content/uploads/sites/6/2020/03/guidelines1995.pdf
arnicasmuseums data metadatahttps://pinboard.in/https://pinboard.in/u:arnicas/b:c3a3e1c90ef9/[2402.13064] Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models2024-02-21T08:36:39+00:00
https://arxiv.org/abs/2402.13064
arnicastraining llm data augmentationhttps://pinboard.in/https://pinboard.in/u:arnicas/b:6086fe1e1db7/GitHub - moj-analytical-services/splink: Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends2024-02-16T13:54:42+00:00
https://github.com/moj-analytical-services/splink
arnicasFast, accurate and scalable probabilistic data linkage with support for multiple SQL backends - moj-analytical-services/splink]]>data deduplicationhttps://pinboard.in/u:arnicas/b:c63c98bdd1a3/Welcome to RDM101 — RDM101 Course2024-02-16T09:29:38+00:00
https://tu-delft-library.github.io/rdm101-book/intro.html
arnicasdata pipelines courses researchhttps://pinboard.in/https://pinboard.in/u:arnicas/b:a70e61632e1f/[2402.05121] Large Language Model for Table Processing: A Survey2024-02-09T07:56:46+00:00
https://arxiv.org/abs/2402.05121
arnicasdata tables llm reference aihttps://pinboard.in/https://pinboard.in/u:arnicas/b:4df7551d0b6d/Home | CIDOC CRM2024-02-08T13:30:59+00:00
https://www.cidoc-crm.org/
arnicasmuseums collections datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:1641041cde51/Always Already Computational • Always Already Computational - Collections as Data2024-01-27T08:12:22+00:00
https://collectionsasdata.github.io/
arnicaslibraries museums datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:104eba4b51f4/Paper page - Genie: Achieving Human Parity in Content-Grounded Datasets Generation2024-01-26T07:25:16+00:00
https://huggingface.co/papers/2401.14367
arnicasdata augmentation texthttps://pinboard.in/https://pinboard.in/u:arnicas/b:f631d8f08aa9/GitHub - allenai/dolma: Data and tools for generating and inspecting OLMo pre-training data.2024-01-22T08:12:17+00:00
https://github.com/allenai/dolma
arnicasdata tools awesome text deduplicationhttps://pinboard.in/https://pinboard.in/u:arnicas/b:27699e6a3406/Nomic Atlas2024-01-21T08:48:05+00:00
https://atlas.nomic.ai/
arnicasdata tools clustering topics embeddings texthttps://pinboard.in/https://pinboard.in/u:arnicas/b:288392a39879/GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.2024-01-21T08:45:12+00:00
https://github.com/huggingface/datatrove
arnicasdata tools nlp deduplication text awesome training llmhttps://pinboard.in/https://pinboard.in/u:arnicas/b:33142dc0d422/Budgeting with ChatGPT | Jon Callahan2024-01-11T07:58:05+00:00
https://www.joncallahan.com/blog/ai-txns/
arnicasdata json articles chatgpthttps://pinboard.in/https://pinboard.in/u:arnicas/b:c5ea139a87da/skrub-data/skrub: Prepping tables for machine learning2023-12-06T14:57:17+00:00
https://github.com/skrub-data/skrub/
arnicasdata cleaning tables deduplication toolshttps://pinboard.in/https://pinboard.in/u:arnicas/b:08eadd110fa1/GitHub - cleanlab/cleanlab: The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.2023-11-23T08:07:48+00:00
https://github.com/cleanlab/cleanlab/
arnicasdata tools cleaning mlhttps://pinboard.in/https://pinboard.in/u:arnicas/b:8f8e0550d9e4/[2310.19019] TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise2023-10-31T07:43:32+00:00
https://arxiv.org/abs/2310.19019
arnicasaugmentation nlp training datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:5a0bbfaf5bab/richardbrath | Visual encodings for data visualization.2023-08-29T07:54:57+00:00
https://richardbrath.wordpress.com/
arnicasbooks text ai nlp infovis training datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:a1a748db9dc0/[2308.04076v1] DataTales: Investigating the use of Large Language Models for Authoring Data-Driven Articles2023-08-16T09:43:20+00:00
https://arxiv.org/abs/2308.04076v1
arnicaswriting generation data infovishttps://pinboard.in/https://pinboard.in/u:arnicas/b:309cb5dc3fb8/Serra-Technologies/serra: Python-based dbt alternative2023-08-16T09:30:17+00:00
https://github.com/Serra-Technologies/serra
arnicaspython data tools dbthttps://pinboard.in/https://pinboard.in/u:arnicas/b:5c4d9a527931/gforsyth/ibis-tutorial2023-08-16T08:47:21+00:00
https://github.com/gforsyth/ibis-tutorial
arnicastools tutorials python datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:7cfb878d869c/GitHub - wandb/weave: Weave, developed by the team at Weights and Biases, is a new open-source toolkit designed for performant, interactive data exploration. Our mission is to equip Machine Learning practitioners with the best tools to turn data into insi2023-06-29T07:27:04+00:00
https://github.com/wandb/weave
arnicasjupyter analysis data tools infovis awesomehttps://pinboard.in/https://pinboard.in/u:arnicas/b:1be3905abaf8/a-pretrainers-guide/A Pretrainer's Guide To Training Data.pdf at main · shayne-longpre/a-pretrainers-guide · GitHub2023-05-28T08:29:01+00:00
https://github.com/shayne-longpre/a-pretrainers-guide/blob/main/A%20Pretrainer's%20Guide%20To%20Training%20Data.pdf
arnicastraining nlp toxic data referencehttps://pinboard.in/https://pinboard.in/u:arnicas/b:78fb25acc5a9/GitHub - Collection-Space-Navigator/CSN: Interactive Visualization Interface for Multidimensional Datasets2023-05-17T07:33:49+00:00
https://github.com/Collection-Space-Navigator/CSN
arnicasmultimodal data tools infovis clustering awesomehttps://pinboard.in/https://pinboard.in/u:arnicas/b:6029cbad96b6/GitHub - cleanlab/cleanvision: Automatically find issues in image datasets and practice data-centric computer vision.2023-05-07T08:05:07+00:00
https://github.com/cleanlab/cleanvision
arnicastools images data cleaninghttps://pinboard.in/https://pinboard.in/u:arnicas/b:c5fd6088742d/fair use and foundation models2023-05-05T08:03:03+00:00
https://arxiv.org/pdf/2303.15715.pdf
arnicasdata copyright nlp ai awesomehttps://pinboard.in/https://pinboard.in/u:arnicas/b:18cb4dd8d8ee/GitHub - 1rgs/jsonformer2023-05-02T07:25:44+00:00
https://github.com/1rgs/jsonformer
arnicasjson generation tools datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:44e4fc712ea8/Class Imbalance Strategies — A Visual Guide with Code | by Travis Tang | Apr, 2023 | Towards Data Science2023-04-27T07:03:44+00:00
https://towardsdatascience.com/class-imbalance-strategies-a-visual-guide-with-code-8bc8fae71e1a
arnicasml data tipshttps://pinboard.in/https://pinboard.in/u:arnicas/b:da012ea654e0/project-baize/baize-chatbot: Let ChatGPT teach your own chatbot in hours with a single GPU!2023-04-14T13:37:22+00:00
https://github.com/project-baize/baize-chatbot
arnicaschatgpt data training modelshttps://pinboard.in/https://pinboard.in/u:arnicas/b:ea14316b9806/[2304.03022] TagGPT: Large Language Models are Zero-shot Multimodal Taggers2023-04-07T06:35:09+00:00
https://arxiv.org/abs/2304.03022
arnicasdata labeling nlp imageshttps://pinboard.in/https://pinboard.in/u:arnicas/b:5e858c76d312/Instruction Tuning with GPT-42023-04-07T06:33:56+00:00
https://instruction-tuning-with-gpt-4.github.io/
arnicasgpt4 instruction training data generationhttps://pinboard.in/https://pinboard.in/u:arnicas/b:5a6e2597cc1f/comet-ml/kangas: 🦘 Explore multimedia datasets at scale2023-04-05T14:18:03+00:00
https://github.com/comet-ml/kangas
arnicasimages data tools pandas awesomehttps://pinboard.in/https://pinboard.in/u:arnicas/b:988155d82d2d/johnkerl/miller: Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON2023-03-16T12:59:59+00:00
https://github.com/johnkerl/miller
arnicasjson data csv tools unixhttps://pinboard.in/https://pinboard.in/u:arnicas/b:f15b6f2f0403/GitHub - eto-ai/lance: Alternative to Parquet. 100x faster for random access, automatic versioning, optimized for computer vision, bioinformatics, spatial and ML data. Apache Arrow and DuckDB compatible.2023-02-08T07:10:58+00:00
https://github.com/eto-ai/lance
arnicasduckdb parquet performance data toolshttps://pinboard.in/https://pinboard.in/u:arnicas/b:543ef0004d58/Tutorial: ChatGPT Over Your Data2023-02-08T07:06:19+00:00
https://blog.langchain.dev/tutorial-chatgpt-over-your-data/
arnicaschatgpt questions datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:310a04dea4bc/GitHub - webdataset/webdataset: A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.2023-01-23T07:39:32+00:00
https://github.com/webdataset/webdataset
arnicasdata bigdata ai traininghttps://pinboard.in/https://pinboard.in/u:arnicas/b:32b3d79de4b5/onekey-sec/unblob: Extract files from any kind of container formats2023-01-19T08:52:09+00:00
https://github.com/onekey-sec/unblob
arnicasUNIX data toolshttps://pinboard.in/https://pinboard.in/u:arnicas/b:45297f06f1e2/GPT and Pouring Language Through Shape2023-01-14T07:30:19+00:00
https://blog.humphd.org/pouring-language-through-shape/
arnicasgpt3 chatgpt json data generation awesomehttps://pinboard.in/https://pinboard.in/u:arnicas/b:da2fe8369f5b/RandomFractals/vscode-data-preview: Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files2022-12-23T10:09:36+00:00
https://github.com/RandomFractals/vscode-data-preview#configuration
arnicasData Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files - RandomFractals/vscode-data-preview: Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files]]>vscode data toolshttps://pinboard.in/u:arnicas/b:cbf975910f42/ropeladder/record-linkage-resources: Resources for tackling record linkage / deduplication / data matching problems2022-12-19T10:37:02+00:00
https://github.com/ropeladder/record-linkage-resources#name-parsers
arnicasdata text cleaning linking knowledge entities names referencehttps://pinboard.in/https://pinboard.in/u:arnicas/b:7b90b2534bfd/The 'New York Times' Best Seller Lists Theories Explained2022-12-10T08:09:03+00:00
https://www.esquire.com/entertainment/books/a42189320/the-new-york-times-best-seller-lists-explained/
arnicasbooks datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:4ccf90fe7735/data-preparation/preprocessing/training/01b_oscar_cleaning_and_filtering at main · bigscience-workshop/data-preparation · GitHub2022-12-06T09:52:40+00:00
https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training/01b_oscar_cleaning_and_filtering
arnicasdata text cleaninghttps://pinboard.in/https://pinboard.in/u:arnicas/b:b55350197384/Cell editing | OpenRefine2022-12-05T13:33:20+00:00
https://openrefine.org/docs/manual/cellediting#cluster-and-edit
arnicasOverview]]>clustering data cleaning tools deduplicationhttps://pinboard.in/u:arnicas/b:95743b21d652/koaning/human-learn: Natural Intelligence is still a pretty good idea.2022-11-26T09:14:19+00:00
https://github.com/koaning/human-learn/
arnicaslabeling data umap classification toolshttps://pinboard.in/https://pinboard.in/u:arnicas/b:ec4affed4e29/Crosswalker2022-11-23T06:39:28+00:00
https://crosswalker.washingtonpost.com/
arnicasdata text tools cleaning duplicateshttps://pinboard.in/https://pinboard.in/u:arnicas/b:6cc8ab3eb98f/Grammatical Error Correction with Machine Learning — Overview and Implementation | by Farzad Mahmoodinobar | Nov, 2022 | Towards Data Science2022-11-17T07:53:06+00:00
https://towardsdatascience.com/grammatical-error-correction-with-machine-learning-overview-and-implementation-ccd0b50a1700
arnicasdata cleaning text grammar nlphttps://pinboard.in/https://pinboard.in/u:arnicas/b:37db3b95de71/Datasette2022-11-16T07:05:40+00:00
https://lite.datasette.io/
arnicaswasm data databases webhttps://pinboard.in/https://pinboard.in/u:arnicas/b:9098cb123030/Planning to leave Twitter? / Observable / Observable2022-11-08T17:40:30+00:00
https://observablehq.com/@observablehq/save-and-analyze-your-twitter-archive
arnicastwitter data infovis observablehttps://pinboard.in/https://pinboard.in/u:arnicas/b:334fe7b17a75/Convert JSON to Swift, C#, TypeScript, Objective-C, Go, Java, C++ and more • quicktype2022-11-07T14:48:44+00:00
https://quicktype.io/
arnicasjson data tools awesomehttps://pinboard.in/https://pinboard.in/u:arnicas/b:7a0f98bbbc93/GitHub - edsu/wikipediarevs: A commandline utility for downloading the revision history for one or more Wikipedi articles.2022-10-23T07:23:42+00:00
https://github.com/edsu/wikipediarevs#readme
arnicaswikipedia data toolshttps://pinboard.in/https://pinboard.in/u:arnicas/b:723754acc65d/prodigy-recipes/contrib/dedupe at master · explosion/prodigy-recipes2022-10-17T13:02:10+00:00
https://github.com/explosion/prodigy-recipes/tree/master/contrib/dedupe
arnicasdeduplication data cleaninghttps://pinboard.in/u:arnicas/b:55730d94ca05/Distance 距離2022-10-17T07:08:36+00:00
https://kyndinfo.notion.site/Distance-7b1350a2b6374478a177c2ea275cc651
arnicasinfovis data awesomehttps://pinboard.in/https://pinboard.in/u:arnicas/b:8b4c3ead8d92/About — Python Record Linkage Toolkit 0.15 documentation2022-10-13T15:28:16+00:00
https://recordlinkage.readthedocs.io/en/latest/about.html#introduction
arnicasdata cleaning deduplicationhttps://pinboard.in/https://pinboard.in/u:arnicas/b:5f545bdd2ca6/Where Is All the Book Data? - Public Books2022-10-12T07:58:09+00:00
https://www.publicbooks.org/where-is-all-the-book-data/
arnicasbooks datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:f6e92ee8f7d4/GitHub - awesome-panel/examples: A repository of awesome panel examples. The apps are running entirely in the browser as webassembly apps. NO SERVER REQUIRED.2022-10-07T06:38:04+00:00
https://github.com/awesome-panel/examples
arnicasinfovis data python dashboards toolshttps://pinboard.in/https://pinboard.in/u:arnicas/b:dfa1f4b13f1f/Getting tabular data from unstructured text with GPT-3: an ongoing experiment – Roberto Rocha2022-10-06T07:05:26+00:00
https://robertorocha.info/getting-tabular-data-from-unstructured-text-with-gpt-3-an-ongoing-experiment/
arnicasgpt2 data tables nlphttps://pinboard.in/https://pinboard.in/u:arnicas/b:983ddd537ca8/AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability - Waxy.org2022-10-02T07:43:59+00:00
https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/
arnicasdata ethics ai legalhttps://pinboard.in/https://pinboard.in/u:arnicas/b:a63d0780b2b3/Stream Processing and Data Analysis with ksqlDB | by João Pedro | Sep, 2022 | Towards Data Science2022-09-26T11:31:01+00:00
https://towardsdatascience.com/stream-processing-and-data-analysis-with-ksqldb-97f1ca4fcf6a
arnicasstreaming data kafka databaseshttps://pinboard.in/https://pinboard.in/u:arnicas/b:f6cf0d761f45/GitHub - florencesn/Elden-Ring-Survey-2022: Data from an archaeological survey of Elden Ring2022-09-09T07:01:21+00:00
https://github.com/florencesn/Elden-Ring-Survey-2022
arnicasgames data archaeology awesomehttps://pinboard.in/https://pinboard.in/u:arnicas/b:574bf30b6266/nichtich/wikidata-taxonomy: command-line tool to extract taxonomies from Wikidata2022-09-07T15:05:11+00:00
https://github.com/nichtich/wikidata-taxonomy
arnicaswikipedia tools datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:a3b93a72dddf/OpenLink Virtuoso SPARQL Query Editor2022-08-31T14:33:59+00:00
https://wikidata.demo.openlinksw.com/sparql
arnicaswikipedia datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:9e325229c12a/Bergvca/string_grouper: Super Fast String Matching in Python2022-08-17T12:48:19+00:00
https://github.com/Bergvca/string_grouper#find-all-matches-within-a-single-dataset
arnicasdata text cleaning clustering duplicateshttps://pinboard.in/https://pinboard.in/u:arnicas/b:40b6b184b070/data-preparation/preprocessing/filtering/deduplicate at main · bigscience-workshop/data-preparation2022-08-17T11:20:35+00:00
https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/filtering/deduplicate
arnicastext data cleaning toolshttps://pinboard.in/https://pinboard.in/u:arnicas/b:f7e3b2d27142/Group thousands of similar spreadsheet text cells in seconds | by Luke Whyte | Towards Data Science2022-08-17T09:49:38+00:00
https://towardsdatascience.com/group-thousands-of-similar-spreadsheet-text-cells-in-seconds-2493b3ce6d8d
arnicasdata nlp cleaning text duplicates clusteringhttps://pinboard.in/https://pinboard.in/u:arnicas/b:78b441b89a03/datasketch: Big Data Looks Small — datasketch 1.0.0 documentation2022-08-17T09:39:51+00:00
http://ekzhu.com/datasketch/index.html
arnicasdata cleaning duplicates texthttps://pinboard.in/https://pinboard.in/u:arnicas/b:f7d966c94fe2/google-research/deduplicate-text-datasets2022-08-17T09:36:14+00:00
https://github.com/google-research/deduplicate-text-datasets
arnicasdata tools nlp cleaning duplicateshttps://pinboard.in/https://pinboard.in/u:arnicas/b:5c354c4de635/An Exhausting Attempt of Reviewing Perec’s An Attempt at Exhausting a Place in Paris | HTMLGIANT2022-06-21T15:26:23+00:00
https://htmlgiant.com/reviews/an-exhausting-attempt-of-reviewing-perec%E2%80%99s-an-attempt-at-exhausting-a-place-in-paris/
arnicaswriting data awesomehttps://pinboard.in/https://pinboard.in/u:arnicas/b:ac8370bb46e8/Quickly Explore & Analyze Your Data For Faster Insights / Observable / Observable2022-06-17T07:18:21+00:00
https://observablehq.com/@observablehq/introducing-data-table-cell?collection=%40observablehq%2Fobservable-blog
arnicasexcel observable tables data tools awesomehttps://pinboard.in/https://pinboard.in/u:arnicas/b:581ea14ac5ec/The Not Tale (Funeral) by Caroline Bergvall | Poetry Magazine2022-06-14T14:21:11+00:00
https://www.poetryfoundation.org/poetrymagazine/poems/52690/the-not-tale-funeral
arnicaspoetry datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:695e2ca2e87f/GitHub - timkpaine/tributary: Streaming reactive and dataflow graphs in Python2022-06-09T11:39:19+00:00
https://github.com/timkpaine/tributary
arnicasairflow datahttps://pinboard.in/https://pinboard.in/u:arnicas/b:b2eea98674d6/