Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmPete Hunt's contrarian RDBMS tips2023-12-18T15:11:17+00:00
https://twitter.com/floydophone/status/1708567151953743903
jm
1. It's often better to add tables than alter existing ones. This is especially true in a larger company. Making changes to core tables that other teams depend on is very risky and can be subject to many approvals. This reduces your team's agility a lot.
Instead, try adding a new table that is wholly owned by your team. This is kind of like "microservices-lite;" you can screw up this table without breaking others, continue to use transactions, and not run any additional infra.
(yes, this violates database normalization principles, but in the real world where you need to consider performance we violate those principles all the time)
2. Think in terms of indexes first. Every single time you write a query, you should first think: "which index should I use?" If no usable index exists, create it (or create a separate table with that index, see point 1). When writing the query, add a comment naming the index.
Before you commit any queries to the codebase, write a script to fill up your local development DB with 100k+ rows, and run EXPLAIN on your query. If it doesn't use that index, it's not ready to be committed. Baking this into an automated test would be better, but is hard to do.
3. Consider moving non-COUNT(*) aggregations out of the DB. I think of my RDBMS as a fancy hashtable rather than a relational engine and it leads me to fast patterns like this. Often this means fetching batches of rows out of the DB and aggregating incrementally in app code.
(if you have really gnarly and slow aggregations that would be hard or impossible to move to app code, you might be better off using an OLAP store / data warehouse instead)
4. Thinking in terms of "node" and "edge" tables can be useful. Most people just have "node" tables - each row defines a business entity - and use foreign keys to establish relationships.
Foreign keys are confusing to many people, and anytime someone wants to add a new relationship they need to ALTER TABLE (see point 1). Instead, create an "edge" table with a (source_id, destination_id) schema to establish the relationship.
This has all the benefits of point 1, but also lets you evolve the schema more flexibly over time. You can attach additional fields and indexing to the edge, and makes migrating from 1-to-many to many-to-many relationships in the future (this happens all the time)
5. Usually every table needs "created_at" and/or "updated_at" columns. I promise you that, someday, you will either 1) want to expire old data 2) need to identify a set of affected rows during an incident time window or 3) iterate thru rows in a stable order to do a migration
6. Choosing how IDs are structured is super important. Never use autoincrement. Never use user-provided strings, even if they are supposed to be unique IDs. Always use at least 64 bits. Snowflake IDs (https://en.wikipedia.org/wiki/Snowflake_ID) or ULIDs (https://github.com/ulid/spec) are a great choice.
7. Comment your queries so debugging prod issues is easier. Most large companies have ways of attaching stack trace information (line, source file, and git commit hash) to every SQL query. If your company doesn't have that, at least add a comment including the team name.
Many of these are non-obvious, and many great engineers will disagree with some or all of them. And, of course, there are situations when you should not follow them. YMMV!
Number 5 is absolutely, ALWAYS true, in my experience. And I love the idea of commenting queries... must follow more of these.]]>rdbms databases oltp data querying storage architecturehttps://pinboard.in/https://pinboard.in/u:jm/b:171bde124c55/Vector Embeddings2023-10-03T10:24:40+00:00
https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
jm
Text [vector] embeddings measure the relatedness of text strings. Embeddings are commonly used for:
Search (where results are ranked by relevance to a query string);
Clustering (where text strings are grouped by similarity);
Recommendations (where items with related text strings are recommended);
Anomaly detection (where outliers with little relatedness are identified);
Diversity measurement (where similarity distributions are analyzed);
Classification (where text strings are classified by their most similar label);
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
Commonly used as a storage format in vector databases (cf. https://vercel.com/guides/vector-databases). Search using text embeddings is therefore implemented using cosine similarity or k-nearest neighbour to find vector similarity.
Looks like https://www.trychroma.com/ is the current open source vector DB of choice, at the moment.
(via Simon Willison)]]>ai openai via:simonw vector-embeddings text-embeddings text storage databases search similarity clustering recommendations anomaly-detection classification vector-databaseshttps://pinboard.in/https://pinboard.in/u:jm/b:95c03aa49119/SQLite has Write-Ahead Logging2023-06-23T09:28:05+00:00
https://sqlite.org/wal.html
jmdatabases performance unix sqlite wordpress django wal concurrencyhttps://pinboard.in/https://pinboard.in/u:jm/b:a64af67329c6/Exploring performance differences between Amazon Aurora and vanilla MySQL | Plaid2023-04-12T08:28:02+00:00
https://plaid.com/blog/exploring-performance-differences-between-amazon-aurora-and-vanilla-mysql/
jmbecause Aurora MySQL primary and replica instances share a storage layer, they share a set of undo logs. This means that, for a REPEATABLE READ isolation level, the storage instance must maintain undo logs at least as far back as could be required to satisfy transactional guarantees for the primary or any read replica instance. Long-running replica transactions can negatively impact writer performance in Aurora MySQL—finally, an explanation for the incident that spawned this investigation.
The same scenario plays out differently in vanilla MySQL because of its different model for undo logs.
Vanilla MYSQL: there are two undo logs – one on the writer, and one on the reader. The performance impact of an operation that prevents the garbage collection of undo log records will be isolated to either the writer or the reader.
Aurora MySQL: there is a single undo log that is shared between the writer and reader. The performance impact of an operation that prevents the garbage collection of undo log records will affect the entire cluster.
]]>aurora aws mysql performance databases isolation-levelshttps://pinboard.in/https://pinboard.in/u:jm/b:06b1342ee8c4/MariaDB.com is dead, long live MariaDB.org2023-04-07T08:50:26+00:00
https://medium.com/@imashadowphantom/mariadb-com-is-dead-long-live-mariadb-org-b8a0ca50a637
jm
Monty, the creator of MySQL and MariaDB founder, hasn’t been at a company meeting for over a year and a half. The relationship between Monty and the CEO, Michael Howard, is extremely rocky. At a company all-hands meeting Monty and Michael Howard were shouting at each other while up on stage in the auditorium in front of the entire staff. Monty made his position perfectly clear as he shouted his last words before he walked out:
“You’re killing my fu@$! company!!!”
Monty was subsequently voted off the board in July of 2022 solidifying the hostile takeover by Michael Howard. Buyer beware, Monty and his group of founders and database experts are no longer at the company.
At least the open-source product is still trustworthy, though.]]>databases storage mariadb software open-source companieshttps://pinboard.in/https://pinboard.in/u:jm/b:1612abb9e7f0/Every Cloud Architecture2023-01-10T16:54:37+00:00
https://www.goodtechthings.com/every-cloud-architecture/
jmarchitecture cloud comics containers event-bus funny databaseshttps://pinboard.in/https://pinboard.in/u:jm/b:0e59f109b7f3/cachegrand2022-09-15T11:07:09+00:00
https://github.com/danielealbano/cachegrand
jmcache databases caching storage key-value-storeshttps://pinboard.in/https://pinboard.in/u:jm/b:5323fba149ee/Tailscale moving _back_ to SQLite2022-04-04T13:24:34+00:00
https://tailscale.com/blog/database-for-2022/
jmdatabases sqlite sql storage litestream wal s3 statehttps://pinboard.in/https://pinboard.in/u:jm/b:b30c08ee0f3d/RangeBitmap2022-03-10T15:04:14+00:00
https://richardstartin.github.io/posts/range-bitmap-index
jmbitmaps algorithms coding ranges indexes indexing databases storage pinot richard-startinhttps://pinboard.in/https://pinboard.in/u:jm/b:74bce9fe122d/Databases to keep an eye on2022-01-06T10:22:25+00:00
https://pradeepchhetri.xyz/databasestokeepaneyeon/
jmstorage databases open-source rusthttps://pinboard.in/https://pinboard.in/u:jm/b:37f8cb90dd9c/What's New in ClickHouse 21.122021-12-20T10:12:28+00:00
https://news.ycombinator.com/item?id=29577794
jmclickhouse time-series yandex storage databases via:hnhttps://pinboard.in/https://pinboard.in/u:jm/b:372b4269aa03/OpenStreetMap looks to relocate to EU due to Brexit limitations2021-06-30T10:24:23+00:00
https://www.theguardian.com/politics/2021/jun/30/openstreetmap-looks-to-relocate-to-eu-due-to-brexit-limitations
jmOne “important reason”, Rischard said, was the failure of the UK and EU to agree on mutual recognition of database rights. While both have an agreement to recognise copyright protections, that only covers work which is creative in nature.
Maps, as a simple factual representation of the world, are not covered by copyright in the same way, but until Brexit were covered by an EU-wide agreement that protected databases where there had been “a substantial investment in obtaining, verifying or presenting the data”. But since Brexit, any database made on or after 1 January 2021 in the UK will not be protected in the EU, and vice versa.
Other concerns Rischard listed include the increasing complexity and cost of “banking, finance and using PayPal in the UK”, the inability for the organisation to secure charitable status, and the loss of .eu domains.
The increased importance of the EU in matters of tech regulation also played a role: “We could more effectively lobby the EU [and] EU governments and have more of an impact, especially in countries where there is no local chapter,” Rischard wrote.]]>mapping brexit uk osm openstreetmap eu copyright databases iphttps://pinboard.in/https://pinboard.in/u:jm/b:c66761c45e4c/inside the LAPD/LASD usage of Palantir2020-09-30T09:33:08+00:00
https://www.buzzfeednews.com/article/carolinehaskins1/training-documents-palantir-lapd
jmMuch of the LAPD data consists of the names of people arrested for, convicted of, or even suspected of committing crimes, but that’s just where it starts. Palantir also ingests the bycatch of daily law enforcement activity. Maybe a police officer was told a person knew a suspected gang member. Maybe an officer spoke to a person who lived near a crime “hot spot,” or was in the area when a crime happened. Maybe a police officer simply had a hunch. The context is immaterial. Once the LAPD adds a name to Palantir’s database, that person becomes a data point in a massive police surveillance system. [...] At great taxpayer expense, and without public oversight or regulation, Palantir helped the LAPD construct a vast database that indiscriminately lists the names, addresses, phone numbers, license plates, friendships, romances, jobs of Angelenos — the guilty, innocent, and those in between.
This is absolute garbage -- total bias built-in. No evidence required to get a person in the firing line:
“The focus of a data-driven surveillance system is to put a lot of innocent people in the system,” Ferguson said. “And that means that many folks who end up in the Palantir system are predominantly poor people of color, and who have already been identified by the gaze of police.”]]>palantir databases privacy law lapd lasd los-angeles surveillance big-brother police crime gangshttps://pinboard.in/https://pinboard.in/u:jm/b:414f0c4a714e/Star-Tree Index: Powering Fast Aggregations on Pinot | LinkedIn Engineering2020-01-22T23:05:56+00:00
https://engineering.linkedin.com/blog/2019/06/star-tree-index--powering-fast-aggregations-on-pinot
jmWith such huge improvements for both latency and throughput, the Star-Tree index only costs about 12% extra storage space compared to data without indexing techniques and 6% extra compared to data with inverted index.
]]>star-tree sql querying search pinot linkedin algorithms databases indexing indexeshttps://pinboard.in/https://pinboard.in/u:jm/b:476e7d283e21/Serving 100µs reads with 100% availability · Segment Blog2020-01-09T17:20:25+00:00
https://segment.com/blog/separating-our-data-and-control-planes-with-ctlstore/
jmarchitecture databases performance sqlite segment ops dockerhttps://pinboard.in/https://pinboard.in/u:jm/b:4a4fc0c28540/ankane/strong_migrations: Catch unsafe Rails migrations at dev time2019-04-08T09:39:48+00:00
https://github.com/ankane/strong_migrations
jm
Strong Migrations detects potentially dangerous operations in [Rails database] migrations, prevents them from running by default, and provides instructions on safer ways to do what you want.
]]>database migrations rails releases ops databases mysql ruby gemshttps://pinboard.in/https://pinboard.in/u:jm/b:11703c588753/Attack of the week: searchable encryption and the ever-expanding leakage function2019-02-13T14:20:18+00:00
https://blog.cryptographyengineering.com/2019/02/11/attack-of-the-week-searchable-encryption-and-the-ever-expanding-leakage-function/
jmIn all seriousness: database encryption has been a controversial subject in our field. I wish I could say that there’s been an actual debate, but it’s more that different researchers have fallen into different camps, and nobody has really had the data to make their position in a compelling way. There have actually been some very personal arguments made about it. The schools of thought are as follows:
The first holds that any kind of database encryption is better than storing records in plaintext and we should stop demanding things be perfect, when the alternative is a world of constant data breaches and sadness.
To me this is a supportable position, given that the current attack model for plaintext databases is something like “copy the database files, or just run a local SELECT * query”, and the threat model for an encrypted database is “gain persistence on the server and run sophisticated statistical attacks.” Most attackers are pretty lazy, so even a weak system is probably better than nothing.
The countervailing school of thought has two points: sometimes the good is much worse than the perfect, particularly if it gives application developers an outsized degree of confidence of the security that their encryption system is going to provide them.
If even the best encryption protocol is only throwing a tiny roadblock in the attacker’s way, why risk this at all? Just let the database community come up with some kind of ROT13 encryption that everyone knows to be crap and stop throwing good research time into a problem that has no good solution.
I don’t really know who is right in this debate. I’m just glad to see we’re getting closer to having it.
(via Jerry Connolly)
]]>cryptography attacks encryption database crypto security storage ppi gdpr search databases via:ecksorhttps://pinboard.in/https://pinboard.in/u:jm/b:1935af4cab15/Trek10 | From relational DB to single DynamoDB table: a step-by-step exploration2019-01-08T12:33:01+00:00
https://www.trek10.com/blog/dynamodb-single-table-relational-modeling/
jmIs modeling my relational database in a single DynamoDB table really a good idea?
About a year ago, I wrote a fairly popular article called “Why DynamoDB isn’t for everyone”. Many of the technical criticisms of DynamoDB I put forth at that time (lack of operational controls such as backup/restore; a persistent problem with hot keys) have since been partially or fully resolved due to a truly awe-inspiring run of feature releases from the DynamoDB team.
However, the central argument of that article remains valid: DynamoDB is a powerful tool when used properly, but if you don’t know what you’re doing it’s a deceptively user-friendly guide into madness. And the further you stray into esoteric applications like relational modeling, the more sure you’d better be that you know what you’re getting into. Especially with SQL-friendly “serverless” databases like Amazon Aurora hitting their stride, you have a lot of fully-managed options with a smaller learning curve.
]]>dynamodb databases storage nosql sql relational aws relationshttps://pinboard.in/https://pinboard.in/u:jm/b:3cda4573d696/UIDAI’s Aadhaar Software Hacked, ID Database Compromised, Experts Confirm2018-09-11T10:20:23+00:00
https://www.huffingtonpost.in/2018/09/11/uidai-s-aadhaar-software-hacked-id-database-compromised-experts-confirm_a_23522472/
jmThe authenticity of the data stored in India's controversial Aadhaar identity database, which contains the biometrics and personal information of over 1 billion Indians, has been compromised by a software patch that disables critical security features of the software used to enrol new Aadhaar users, a three month-long investigation by HuffPost India reveals.
The patch—freely available for as little as Rs 2,500 (around $35)— allows unauthorised persons, based anywhere in the world, to generate Aadhaar numbers at will, and is still in widespread use.
This has significant implications for national security at a time when the Indian government has sought to make Aadhaar numbers the gold standard for citizen identification, and mandatory for everything from using a mobile phone to accessing a bank account.
]]>security aadhaar identity india privacy databases data-privacyhttps://pinboard.in/https://pinboard.in/u:jm/b:9c715a8c2ce5/Best Practices for DynamoDB2018-04-17T09:51:56+00:00
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html
jmdynamodb nosql aws storage databases design codinghttps://pinboard.in/https://pinboard.in/u:jm/b:cb830b6ff7c4/Amazon DynamoDB Adds Support for Continuous Backups and Point-In-Time Recovery (PITR)2018-03-27T08:37:26+00:00
https://aws.amazon.com/about-aws/whats-new/2018/03/amazon-dynamodb-adds-support-for-continuous-backups-and-point-in-time-recovery/
jmdynamodb storage databases aws ops architecture recoveryhttps://pinboard.in/https://pinboard.in/u:jm/b:3c9e5c94bad2/High Volume Ingest2017-12-19T15:03:48+00:00
https://www.circonus.com/2017/12/high-volume-ingest/
jmcirconus time-series irondb databases storage architecture codinghttps://pinboard.in/https://pinboard.in/u:jm/b:23ecf8defae2/How to ensure Presto scalability in multi user case2017-11-17T14:45:09+00:00
https://www.slideshare.net/lewuathe/how-to-ensure-presto-scalability-in-multi-use-case-70950664
jmpresto presentations slides storage databaseshttps://pinboard.in/https://pinboard.in/u:jm/b:c9625935de47/MaxMind DB File Format Specification2017-10-24T14:31:08+00:00
http://maxmind.github.io/MaxMind-DB/
jmmaxmind databases storage ipv4 ipv6 addresses bst binary-search-trees trees data-structureshttps://pinboard.in/https://pinboard.in/u:jm/b:a3c83535331a/A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications - All Things Distributed2017-10-09T15:29:38+00:00
http://www.allthingsdistributed.com/2017/10/a-decade-of-dynamo.html?__s=gf36pf8g1gjugcqh6ppo
jmA deep dive on how we were using our existing databases revealed that they were frequently not used for their relational capabilities. About 70 percent of operations were of the key-value kind, where only a primary key was used and a single row would be returned. About 20 percent would return a set of rows, but still operate on only a single table.
With these requirements in mind, and a willingness to question the status quo, a small group of distributed systems experts came together and designed a horizontally scalable distributed database that would scale out for both reads and writes to meet the long-term needs of our business. This was the genesis of the Amazon Dynamo database.
The success of our early results with the Dynamo database encouraged us to write Amazon's Dynamo whitepaper and share it at the 2007 ACM Symposium on Operating Systems Principles (SOSP conference), so that others in the industry could benefit. The Dynamo paper was well-received and served as a catalyst to create the category of distributed database technologies commonly known today as "NoSQL."
That's not an exaggeration. Nice one Werner et al!]]>dynamo history nosql storage databases distcomp amazon papers acm data-storeshttps://pinboard.in/https://pinboard.in/u:jm/b:34ce49c8b5d2/"Why We Built Our Own Distributed Column Store" (video)2017-10-09T15:07:32+00:00
https://www.youtube.com/watch?time_continue=168&v=tr2KcekX2kk
jmscuba retriever storage data-stores columnar-storage honeycomb.io databases via:charitymajorshttps://pinboard.in/https://pinboard.in/u:jm/b:3f714f19d469/The Israeli Digital Rights Movement's campaign for privacy | Internet Policy Review2017-09-29T09:41:58+00:00
https://policyreview.info/articles/analysis/israeli-digital-rights-movements-campaign-privacy
jmThis study explores the persuasion techniques used by the Israeli Digital Rights Movement in its campaign against Israel’s biometric database. The research was based on analysing the movement's official publications and announcements and the journalistic discourse that surrounded their campaign within the political, judicial, and public arenas in 2009-2017. The results demonstrate how the organisation navigated three persuasion frames to achieve its goals: the unnecessity of a biometric database in democracy; the database’s ineffectiveness; and governmental incompetence in securing it. I conclude by discussing how analysing civil society privacy campaigns can shed light over different regimes of privacy governance. [....]
1. Why the database should be abolished: because it's not necessary - As the organisation highlighted repeatedly throughout the campaign with the backing of cyber experts, there is a significant difference between issuing smart documents and creating a database. Issuing smart documents effectively solves the problem of stealing and forging official documents, but does it necessarily entail the creation of a database? The activists’ answer is no: they declared that while they do support the transition to smart documents (passports and ID cards) for Israeli citizens, they object to the creation of a database due to its violation of citizens' privacy.
2. Why the database should be abolished: because it's ineffective; [...]
3. Why the database should be abolished: because it will be breached - The final argument was that the database should be abolished because the government would not be able to guarantee protection against security breaches, and hence possible identity theft.
]]>digital-rights privacy databases id-cards israel psc drm identity-theft securityhttps://pinboard.in/https://pinboard.in/u:jm/b:eaa5bdc6ab84/Firms involved in biometric database in India contracted by Irish government2017-09-09T09:20:29+00:00
https://www.irishtimes.com/business/technology/firms-involved-in-biometric-database-in-india-contracted-by-irish-government-1.3214640?mode=amp
jmTwo tech firms – one owned by businessman Dermot Desmond – involved in the creation of a controversial biometric database in India, are providing services for the Government’s public services card and passports. Known as the Aadhaar project, the Indian scheme is the world’s largest ever biometric database involving 1.2 billion citizens. Initially voluntary, it became mandatory for obtaining state services, for paying taxes and for opening a bank account.
[...]
Dermot Casey, a former chief technology officer of Storyful, said that if the Daon system was used to store the data and carry out the facial matching then the Government “appears to have purchased a biometric database system which can be extended to include voice, fingerprint and iris identification at a moment’s notice”.
Katherine O’Keefe, a data protection consultant with Castlebridge, said if the departments were using images of people’s faces to single out or identify an individual, they were “by legal definition processing biometric data”.]]>biometrics databases aadhar id-cards ireland psc daon morphohttps://pinboard.in/https://pinboard.in/u:jm/b:ddf30f99c8bb/Will the last person at Basho please turn out the lights? • The Register2017-07-17T10:37:58+00:00
https://www.theregister.co.uk/2017/07/13/will_the_last_person_at_basho_get_the_lights_oh_too_late/?mt=1499966719888
jmBasho, once a rising star of the NoSQL database world, has faded away to almost nothing [...] According to sources, the company, which developed the Riak distributed database, has been shedding engineers for months, and is now operating as a shadow of its former self, as at least one buy-out has fallen through.
]]>basho riak nosql databases storage startups fundinghttps://pinboard.in/https://pinboard.in/u:jm/b:91de9bb76c3c/Don't Settle For Eventual Consistency2017-06-30T13:01:55+00:00
https://yokota.blog/2017/02/17/dont-settle-for-eventual-consistency/
jmWith an AP system, you are giving up consistency, and not really gaining anything in terms of effective availability, the type of availability you really care about. Some might think you can regain strong consistency in an AP system by using strict quorums (where the number of nodes written + number of nodes read > number of replicas). Cassandra calls this “tunable consistency”. However, Kleppmann has shown that even with strict quorums, inconsistencies can result.10 So when choosing (algorithmic) availability over consistency, you are giving up consistency for not much in return, as well as gaining complexity in your clients when they have to deal with inconsistencies.
]]>cap-theorem databases storage cap consistency cp ap eventual-consistencyhttps://pinboard.in/https://pinboard.in/u:jm/b:f658ec7b1871/_Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases_2017-05-15T13:26:17+00:00
http://allthingsdistributed.com/files/p1041-verbitski.pdf
jmvia:rbranson aurora aws amazon databases storage papers architecturehttps://pinboard.in/https://pinboard.in/u:jm/b:fdc89aeea120/Amazon DynamoDB Accelerator (DAX)2017-04-20T10:36:09+00:00
https://aws.amazon.com/dynamodb/dax/
jmAmazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. DAX does all the heavy lifting required to add in-memory acceleration to your DynamoDB tables, without requiring developers to manage cache invalidation, data population, or cluster management.
No latency percentile figures, unfortunately. Also still in preview.]]>amazon dynamodb aws dax performance storage databases latency low-latencyhttps://pinboard.in/https://pinboard.in/u:jm/b:cff8a162539c/[untitled]2017-03-24T09:54:31+00:00
http://www.bailis.org/papers/acidrain-sigmod2017.pdf
jmdatabases transactions vulnerability security acidrain peter-bailis storage isolation acidhttps://pinboard.in/https://pinboard.in/u:jm/b:cc8f1e52e5ab/Instapaper Outage Cause & Recovery2017-02-14T10:42:03+00:00
https://medium.com/making-instapaper/instapaper-outage-cause-recovery-3c32a7e9cc5f#.39bn4xjas
jmWithout knowledge of the pre-April 2014 file size limit, it was difficult to foresee and prevent this issue. As far as we can tell, there’s no information in the RDS console in the form of monitoring, alerts or logging that would have let us know we were approaching the 2TB file size limit, or that we were subject to it in the first place. Even now, there’s nothing to indicate that our hosted database has a critical issue.
]]>limits aws rds databases mysql filesystems ops instapaper riskshttps://pinboard.in/https://pinboard.in/u:jm/b:70ad5bf26582/square/shift2017-02-03T11:00:23+00:00
https://github.com/square/shift
jmdatabases mysql sql migrations ops square ddl perconahttps://pinboard.in/https://pinboard.in/u:jm/b:fe923a6ce9c1/Booking.com, MySQL and UTF-82016-12-12T11:21:30+00:00
https://www.percona.com/live/europe-amsterdam-2015/sites/default/files/slides/PL_AMS_Unicode_Booking169_v3.pdf
jmutf-8 utf8mb4 mysql storage databases slides booking.com character-setshttps://pinboard.in/https://pinboard.in/u:jm/b:12c8fe17e0d5/MemC3: Compact and concurrent Memcache with dumber caching and smarter hashing2016-11-02T17:53:21+00:00
https://blog.acolyer.org/2016/11/02/memc3-compact-and-concurrent-memcache-with-dumber-caching-and-smarter-hashing/
jmAn improved hashing algorithm called optimistic cuckoo hashing, and a CLOCK-based eviction algorithm that works in tandem with it. They are evaluated in the context of Memcached, where combined they give up to a 30% memory usage reduction and up to a 3x improvement in queries per second as compared to the default Memcached implementation on read-heavy workloads with small objects (as is typified by Facebook workloads).
]]>memcached performance key-value-stores storage databases cuckoo-hashing algorithms concurrency caching cache-eviction memory throughputhttps://pinboard.in/https://pinboard.in/u:jm/b:db93b3a89ad7/Amazon ElastiCache for Redis Update – Sharded Clusters, Engine Improvements, and More | AWS Blog2016-11-01T11:31:38+00:00
https://aws.amazon.com/blogs/aws/amazon-elasticache-for-redis-update-sharded-clusters-engine-improvements-and-more/?sc_channel=em&sc_campaign=global_LA_elasticache_2016_cluster_1_20161020_ampat.launch_la_ot-elasticache_2016_cluster_1_send&sc_publisher=aws&sc_medium=em_23618&sc_content=launch_la_ot&sc_country=global&sc_geo=global&sc_category=elasticache&sc_outcome=launch&trk=em_23618&mkt_tok=eyJpIjoiT1dSaU56UmhOR013WXpsaiIsInQiOiIxUnY0cWRSaHlQSFhpc0RzUVdhbGxRY0c5a09iekpuZ3lGVmlyNHMrRkwyM0NVeVhDQllZeDZlT1N1dVBZZlIxeVd1aVpoRmo5YmhsZWR4VDlrc0tWQk1LZlwvdkhmek9QRVJaU1hMQXBjVHc9In0%3D
jmelasticache sharding storage aws databases redis opshttps://pinboard.in/https://pinboard.in/u:jm/b:10191f450066/Individual children's details passed to Home Office for immigration purposes | UK news | The Guardian2016-10-13T09:06:25+00:00
https://www.theguardian.com/uk-news/2016/oct/12/individual-childrens-details-passed-to-home-office-for-immigration-purposes
jmparents databases data pod uk home-office education schoolshttps://pinboard.in/https://pinboard.in/u:jm/b:9bdaac33ddd0/Charity Majors responds to the CleverTap Mongo outage war story2016-10-04T11:26:44+00:00
https://charity.wtf/2016/10/02/the-accidental-dba/
jmYou can’t just go “dudes it’s faster” and jump off a cliff. This shit is basic. Test real production workloads. Have a rollback plan. (Not for *10 days* … try a month or two.)
The only thing I'd nitpick on is that it's all very well to say "buy my book" or "come see me talk at Blahcon", but a good blog post or webpage would be thousands of times more useful.]]>databases stateful-services services ops mongodb charity-majors rollback state storage testing dbahttps://pinboard.in/https://pinboard.in/u:jm/b:15609640fcf7/Cross-Region Read Replicas for Amazon Aurora2016-06-14T09:19:08+00:00
https://aws.amazon.com/blogs/aws/new-cross-region-read-replicas-for-amazon-aurora/
jmCreating a read replica in another region also creates an Aurora cluster in the region. This cluster can contain up to 15 more read replicas, with very low replication lag (typically less than 20 ms) within the region (between regions, latency will vary based on the distance between the source and target). You can use this model to duplicate your cluster and read replica setup across regions for disaster recovery. In the event of a regional disruption, you can promote the cross-region replica to be the master. This will allow you to minimize downtime for your cross-region application. This feature applies to unencrypted Aurora clusters.
]]>aws mysql databases storage replication cross-region failover reliability aurorahttps://pinboard.in/https://pinboard.in/u:jm/b:4bc9688d386b/_DataEngConf: Parquet at Datadog_2016-05-18T10:07:35+00:00
http://www.slideshare.net/g33ktalk/dataengconf-parquet-at-datadog-fast-efficient-portable-storage-for-big-data
jmdatadog parquet storage s3 databases hadoop map-reduce big-datahttps://pinboard.in/https://pinboard.in/u:jm/b:280970445c33/Counting with domain specific databases — The Smyte Blog — Medium2016-04-05T16:22:06+00:00
https://medium.com/the-smyte-blog/counting-with-domain-specific-databases-73c660472da#.5ax7b2kqo
jmkafka rocksdb kubernetes counting databases storage opshttps://pinboard.in/https://pinboard.in/u:jm/b:df965fcb6bab/These unlucky people have names that break computers2016-03-29T11:55:39+00:00
http://www.bbc.com/future/story/20160325-the-names-that-break-computer-systems
jmdatabases design programming names coding japan schemashttps://pinboard.in/https://pinboard.in/u:jm/b:005f384de103/Jepsen: RethinkDB 2.1.52016-01-04T14:05:03+00:00
https://aphyr.com/posts/329-jepsen-rethinkdb-2-1-5
jmI’ve run hundreds of test against RethinkDB at majority/majority, at various timescales, request rates, concurrencies, and with different types of failures. Consistent with the documentation, I have never found a linearization failure with these settings. If you use hard durability, majority writes, and majority reads, single-document ops in RethinkDB appear safe.
]]>rethinkdb databases stores storage ops availability cap jepsen tests replicationhttps://pinboard.in/https://pinboard.in/u:jm/b:2efee6cf2e5a/Open-sourcing PalDB, a lightweight companion for storing side data2015-10-28T15:35:31+00:00
https://engineering.linkedin.com/blog/2015/10/open-sourcing-paldb--a-lightweight-companion-for-storing-side-da
jmlinkedin open-source storage side-data data config paldb java apache databaseshttps://pinboard.in/https://pinboard.in/u:jm/b:5f9ff3f038de/Your Relative's DNA Could Turn You Into A Suspect2015-10-16T21:51:35+00:00
http://www.wired.com/2015/10/familial-dna-evidence-turns-innocent-people-into-crime-suspects/
jmThe bewildered Usry soon learned that he was a suspect in the 1996 murder of an Idaho Falls teenager named Angie Dodge. Though a man had been convicted of that crime after giving an iffy confession, his DNA didn’t match what was found at the crime scene. Detectives had focused on Usry after running a familial DNA search, a technique that allows investigators to identify suspects who don’t have DNA in a law enforcement database but whose close relatives have had their genetic profiles cataloged. In Usry’s case the crime scene DNA bore numerous similarities to that of Usry’s father, who years earlier had donated a DNA sample to a genealogy project through his Mormon church in Mississippi. That project’s database was later purchased by Ancestry, which made it publicly searchable—a decision that didn’t take into account the possibility that cops might someday use it to hunt for genetic leads.
Usry, whose story was first reported in The New Orleans Advocate, was finally cleared after a nerve-racking 33-day wait — the DNA extracted from his cheek cells didn’t match that of Dodge’s killer, whom detectives still seek. But the fact that he fell under suspicion in the first place is the latest sign that it’s time to set ground rules for familial DNA searching, before misuse of the imperfect technology starts ruining lives.
]]>dna familial-dna false-positives law crime idaho murder mormon genealogy ancestry.com databases biometrics privacy geneshttps://pinboard.in/https://pinboard.in/u:jm/b:bff8b2e55332/Cluster benchmark: Scylla vs Cassandra2015-10-15T10:49:07+00:00
http://www.scylladb.com/2015/10/13/cluster-benchmark/
jmscylla databases storage cassandra nosqlhttps://pinboard.in/https://pinboard.in/u:jm/b:aa43216c9e0e/After Bara: All your (Data)base are belong to us2015-10-12T14:50:00+00:00
http://www.mcgarrsolicitors.ie/2015/10/03/all-your-database-are-belong-to-us/
jmArticles 10, 11 and 13 of Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995, on the protection of individuals with regard to the processing of personal data and on the free movement of such data, must be interpreted as precluding national measures, such as those at issue in the main proceedings, which allow a public administrative body of a Member State to transfer personal data to another public administrative body and their subsequent processing, without the data subjects having been informed of that transfer or processing.
]]>data databases bara cjeu eu law privacy data-protectionhttps://pinboard.in/https://pinboard.in/u:jm/b:84a29835512b/Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support2015-10-10T20:07:25+00:00
https://support.stripe.com/questions/outage-postmortem-2015-10-08-utc
jmThere was a breakdown in communication between the developer who requested the index migration and the database operator who deleted the old index. Instead of working on the migration together, they communicated in an implicit way through flawed tooling. The dashboard that surfaced the migration request was missing important context: the reason for the requested deletion, the dependency on another index’s creation, and the criticality of the index for API traffic. Indeed, the database operator didn’t have a way to check whether the index had recently been used for a query.
Good demo of how the Etsy-style chatops deployment approach would have helped avoid this risk.]]>stripe postmortem outages databases indexes deployment chatops deploy opshttps://pinboard.in/https://pinboard.in/u:jm/b:464d897d2cd9/SQL on Kafka using PipelineDB2015-09-30T09:07:46+00:00
https://www.pipelinedb.com/blog/sql-on-kafka
jmlogging sql kafka pipelinedb streaming sliding-window databases search queryinghttps://pinboard.in/https://pinboard.in/u:jm/b:9e64c188cfbc/Is there a CAP theorem for Durability?2015-09-28T11:47:51+00:00
http://brooker.co.za/blog/2015/09/26/cap-durability.html
jmdatabases storage marc-brooker cap-theorem cap durability pacelc nosqlhttps://pinboard.in/https://pinboard.in/u:jm/b:9a4e9ff129bc/Scaling Analytics at Amplitude2015-08-31T14:06:02+00:00
https://amplitude.com/blog/2015/08/25/scaling-analytics-at-amplitude/
jmlambda-architecture analytics via:hn redis set-storage storage databases architecture s3 realtimehttps://pinboard.in/https://pinboard.in/u:jm/b:cb0a1f26939e/Mikhail Panchenko's thoughts on the July 2015 CircleCI outage2015-07-21T10:20:42+00:00
http://blog.mihasya.com/2015/07/19/thoughts-evoked-by-circleci-outage.html
jmdatabase-is-not-a-queue mysql sql databases ops outages postmortemshttps://pinboard.in/https://pinboard.in/u:jm/b:689da69282e3/Elements of Scale: Composing and Scaling Data Platforms2015-05-25T15:58:46+00:00
http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/
jmarchitecture storage databases data big-data scaling scalability ben-stopford cqrs druid parquet columnar-stores lambda-architecturehttps://pinboard.in/https://pinboard.in/u:jm/b:50954c7dd941/Please stop calling databases CP or AP2015-05-14T23:29:48+00:00
https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
jmIn his excellent blog post [...] Jeff Hodges recommends that you use the CAP theorem to critique systems. A lot of people have taken that advice to heart, describing their systems as “CP” (consistent but not available under network partitions), “AP” (available but not consistent under network partitions), or sometimes “CA” (meaning “I still haven’t read Coda’s post from almost 5 years ago”).
I agree with all of Jeff’s other points, but with regard to the CAP theorem, I must disagree. The CAP theorem is too simplistic and too widely misunderstood to be of much use for characterizing systems. Therefore I ask that we retire all references to the CAP theorem, stop talking about the CAP theorem, and put the poor thing to rest. Instead, we should use more precise terminology to reason about our trade-offs.
]]>cap databases storage distcomp ca ap cp zookeeper consistency reliability networkinghttps://pinboard.in/https://pinboard.in/u:jm/b:36f8f38a1e8c/Call me maybe: Aerospike2015-05-05T22:17:45+00:00
https://aphyr.com/posts/324-call-me-maybe-aerospike
jmaerospike outages cap testing jepsen aphyr databases storage reliabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:4d9d3218df60/Making Pinterest — Learn to stop using shiny new things and love MySQL2015-04-16T15:19:35+00:00
http://engineering.pinterest.com/post/116038532184/learn-to-stop-using-shiny-new-things-and-love
jmmysql storage databases reliability pinterest architecturehttps://pinboard.in/https://pinboard.in/u:jm/b:ee79bcd2c695/devbook/README.md at master · barsoom/devbook2015-03-25T15:17:01+00:00
https://github.com/barsoom/devbook/blob/master/deploy_without_downtime/README.md
jmactiverecord fail rails mysql sql migrations databases schemas releasinghttps://pinboard.in/https://pinboard.in/u:jm/b:be719470c073/Goodbye MongoDB, Hello PostgreSQL2015-03-11T10:15:28+00:00
http://developer.olery.com/blog/goodbye-mongodb-hello-postgresql/
jmAnother core problem we’ve faced is one of the fundamental features of MongoDB (or any other schemaless storage engine): the lack of a schema. The lack of a schema may sound interesting, and in some cases it can certainly have its benefits. However, for many the usage of a schemaless storage engine leads to the problem of implicit schemas. These schemas aren’t defined by your storage engine but instead are defined based on application behaviour and expectations.
Well, don't say we didn't warn you ;)]]>mongodb mysql postgresql databases storage schemas war-storieshttps://pinboard.in/https://pinboard.in/u:jm/b:da312f02b590/0x74696d | Falling In And Out Of Love with DynamoDB, Part II2015-02-10T23:29:11+00:00
http://0x74696d.com/posts/falling-in-and-out-of-love-with-dynamodb-part-ii/
jmaws dynamodb storage databases architecture opshttps://pinboard.in/https://pinboard.in/u:jm/b:59faf2935f46/Registering children: Ireland’s Primary Online Database2015-01-11T22:23:38+00:00
https://medium.com/@davemolloy/registering-children-irelands-primary-online-database-f11254444cca
jmIf you haven’t heard about it, it is a compulsory database of the personal information of children, including PPS numbers, ethnicity, race and language skills, to be held for decades and shared across State agencies.
]]>privacy ppsn databases pod ireland children kids primary-schoolshttps://pinboard.in/https://pinboard.in/u:jm/b:9aa5f2642386/Good advice on running large-scale database stress tests2014-12-11T14:19:12+00:00
https://twitter.com/aphyr/statuses/542779934851072000
jmbiebermark benchmarks testing performance stress-tests databases storage mongodb innodb foundationdb aphyr measurement distributions keys zipfianhttps://pinboard.in/https://pinboard.in/u:jm/b:9fb30019be5a/If Eventual Consistency Seems Hard, Wait Till You Try MVCC2014-12-09T16:42:44+00:00
http://www.xaprb.com/blog/2014/12/08/eventual-consistency-simpler-than-mvcc/
jmSince I am not ready to assert that there’s a distributed system I know to be better and simpler than eventually consistent datastores, and since I certainly know that InnoDB’s MVCC implementation is full of complexities, for right now I am probably in the same position most of my readers are: the two viable choices seem to be single-node MVCC and multi-node eventual consistency. And I don’t think MVCC is the simpler paradigm of the two.
]]>nosql concurrency databases mysql riak voldemort eventual-consistency reliability storage baron-schwartz mvcc innodb postgresqlhttps://pinboard.in/https://pinboard.in/u:jm/b:0dddeb567400/Aurora for MySQL is coming2014-12-09T16:40:38+00:00
http://smalldatum.blogspot.ie/2014/11/aurora-for-mysql-is-coming.html?showComment=1416563079921
jmmysql databases aurora aws ec2 sql storage transactions replicationhttps://pinboard.in/https://pinboard.in/u:jm/b:b6f304df35af/Hermitage: Testing the "I" in ACID2014-11-28T16:53:54+00:00
http://martin.kleppmann.com/2014/11/25/hermitage-testing-the-i-in-acid.html
jm[Hermitage is] a test suite for databases which probes for a variety of concurrency issues, and thus allows a fair and accurate comparison of isolation levels. Each test case simulates a particular kind of race condition that can happen when two or more transactions concurrently access the same data. Each test can pass (if the database’s implementation of isolation prevents the race condition from occurring) or fail (if the race condition does occur).
]]>acid architecture concurrency databases nosqlhttps://pinboard.in/https://pinboard.in/u:jm/b:edfb2ab71a2b/"Macaroons" for fine-grained secure database access2014-11-28T16:48:51+00:00
http://hackingdistributed.com/2014/11/23/macaroons-in-hyperdex/
jmMacaroons are an excellent fit for NoSQL data storage for several reasons. First, they enable an application developer to enforce security policies at very fine granularity, per object. Gone are the clunky security policies based on the IP address of the client, or the per-table access controls of RDBMSs that force you to split up your data across many tables. Second, macaroons ensure that a client compromise does not lead to loss of the entire database. Third, macaroons are very flexible and expressive, able to incorporate information from external systems and third-party databases into authorization decisions. Finally, macaroons scale well and are incredibly efficient, because they avoid public-key cryptography and instead rely solely on fast hash functions.
]]>security macaroons cookies databases nosql case-studies storage authorization hyperdexhttps://pinboard.in/https://pinboard.in/u:jm/b:a0f88a624765/Mnesia and CAP2014-10-06T15:19:54+00:00
https://medium.com/@jlouis666/mnesia-and-cap-d2673a92850
jmA common “trick” is to claim:
'We assume network partitions can’t happen. Therefore, our system is CA according to the CAP theorem.'
This is a nice little twist. By asserting network partitions cannot happen, you just made your system into one which is not distributed. Hence the CAP theorem doesn’t even apply to your case and anything can happen. Your system may be linearizable. Your system might have good availability. But the CAP theorem doesn’t apply. [...]
In fact, any well-behaved system will be “CA” as long as there are no partitions. This makes the statement of a system being “CA” very weak, because it doesn’t put honesty first. I tries to avoid the hard question, which is how the system operates under failure. By assuming no network partitions, you assume perfect information knowledge in a distributed system. This isn’t the physical reality.
]]>cap erlang mnesia databases storage distcomp reliability ca postgres partitionshttps://pinboard.in/https://pinboard.in/u:jm/b:cb712d5066e6/Understanding weak isolation is a serious problem2014-09-17T10:11:40+00:00
http://www.bailis.org/blog/understanding-weak-isolation-is-a-serious-problem/
jmacid consistency databases peter-bailis transactional corruption serializability isolation reliabilityhttps://pinboard.in/https://pinboard.in/u:jm/b:5c9cae29becb/Aerospike's CA boast gets a thumbs-down from @aphyr2014-09-09T09:43:24+00:00
https://twitter.com/aphyr/statuses/509104476976717825
jmSpecifically, @aerospikedb cannot offer cursor stability, repeatable read, snapshot isolation, or any flavor of serializability.
@nasav @aerospikedb At *best* you can offer Read Committed, which is not, I assert, what most people would expect from an "ACID" database.
]]>aphyr aerospike availability consistency acid transactions distcomp databases storagehttps://pinboard.in/https://pinboard.in/u:jm/b:874bcb9ed3ec/