Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmRangeBitmap2022-03-10T15:04:14+00:00
https://richardstartin.github.io/posts/range-bitmap-index
jmbitmaps algorithms coding ranges indexes indexing databases storage pinot richard-startinhttps://pinboard.in/https://pinboard.in/u:jm/b:74bce9fe122d/lengstrom/falcon2021-07-26T18:52:50+00:00
https://github.com/lengstrom/falcon
jmChrome extension for flexible full text browsing history search. Press f, then space or tab, in the omnibar to start searching your previously visited websites!
Every time you visit a website in Chrome, Falcon indexes all the text on the page so that the site can be easily found later. Then, for example, if you type f mugwort, Falcon will show the websites you visited containing the text "mugwort"! Install from the Chrome store here or get the CRX file!
]]>extension chrome search falcon indexinghttps://pinboard.in/https://pinboard.in/u:jm/b:c7d72de4cde6/Star-Tree Index: Powering Fast Aggregations on Pinot | LinkedIn Engineering2020-01-22T23:05:56+00:00
https://engineering.linkedin.com/blog/2019/06/star-tree-index--powering-fast-aggregations-on-pinot
jmWith such huge improvements for both latency and throughput, the Star-Tree index only costs about 12% extra storage space compared to data without indexing techniques and 6% extra compared to data with inverted index.
]]>star-tree sql querying search pinot linkedin algorithms databases indexing indexeshttps://pinboard.in/https://pinboard.in/u:jm/b:476e7d283e21/glibc changed their UTF-8 character collation ordering across versions, breaking postgres2019-01-11T11:19:17+00:00
https://www.postgresql.org/message-id/flat/BA6132ED-1F6B-4A0B-AC22-81278F5AB81E%40tripadvisor.com
jmStreaming replicas—and by extension, base backups—can become dangerously broken when the source and target machines run slightly different versions of glibc. Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on the slave. These indexes are sorted out of order with respect to the strcoll running on the slave. Because postgres is unaware of the discrepancy is uses these "corrupt" indexes to perform merge joins; merges rely heavily on the assumption that the indexes are sorted and this causes all the results of the join past the first poison pill entry to not be returned. Additionally, if the slave becomes master, the "corrupt" indexes will in cases be unable to enforce uniqueness, but quietly allow duplicate values.
Moral of the story -- keep your libc versions in sync across storage replication sets!]]>postgresql scary ops glibc collation utf-8 characters indexing sorting replicas postgreshttps://pinboard.in/https://pinboard.in/u:jm/b:7a5ed209308e/[LUCENE-6917] Deprecate and rename NumericField/RangeQuery to LegacyNumeric - ASF JIRA2015-12-14T11:30:30+00:00
https://issues.apache.org/jira/browse/LUCENE-6917
jmlucene performance algorithms patches bkd-trees geodata numeric indexinghttps://pinboard.in/https://pinboard.in/u:jm/b:e2894da4d414/"A New Data Structure For Cumulative Frequency Tables"2014-05-02T13:46:18+00:00
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=5E9495D2FDC3BFF4FBA425AA6A4B3E0F?doi=10.1.1.14.8917&rep=rep1&type=pdf
jmnetty frequency-tables data-structures algorithms coding binary-tree indexing compression symbol-alphabetshttps://pinboard.in/https://pinboard.in/u:jm/b:f5fcb61a33cc/FM-index2014-03-18T22:08:39+00:00
http://en.wikipedia.org/wiki/FM-index
jma compressed full-text substring index based on the Burrows-Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,[1] who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for 'Full-text index in Minute space'. It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. Both the query time and storage space requirements are sublinear with respect to the size of the input data.
kragen notes 'gene sequencing is using [them] in production'.]]>sequencing bioinformatics algorithms bowtie fm-index indexing compression search burrows-wheeler bwt full-text-searchhttps://pinboard.in/https://pinboard.in/u:jm/b:007b83895a74/FastBit: An Efficient Compressed Bitmap Index Technology2013-04-09T23:16:02+00:00
http://crd-legacy.lbl.gov/~kewu/fastbit/
jman [LGPL] open-source data processing library following the spirit of NoSQL movement. It offers a set of searching functions supported by compressed bitmap indexes. It treats user data in the column-oriented manner similar to well-known database management systems such as Sybase IQ, MonetDB, and Vertica. It is designed to accelerate user's data selection tasks without imposing undue requirements. In particular, the user data is NOT required to be under the control of FastBit software, which allows the user to continue to use their existing data analysis tools.
The key technology underlying the FastBit software is a set of compressed bitmap indexes. In database systems, an index is a data structure to accelerate data accesses and reduce the query response time. Most of the commonly used indexes are variants of the B-tree, such as B+-tree and B*-tree. FastBit implements a set of alternative indexes called compressed bitmap indexes. Compared with B-tree variants, these indexes provide very efficient searching and retrieval operations, but are somewhat slower to update after a modification of an individual record.
A key innovation in FastBit is the Word-Aligned Hybrid compression (WAH) for the bitmaps.[...] Another innovation in FastBit is the multi-level bitmap encoding methods.
]]>fastbit nosql algorithms indexing search compressed-bitmaps indexes wah bitmaps compressionhttps://pinboard.in/https://pinboard.in/u:jm/b:16e9a9a6e9a4/Why I'm Walking Away From CouchDB2013-04-09T09:41:26+00:00
http://donpark.org/blog/2012/06/01/why-i-m-walking-away-from-couchdb
jmIn practice there are two gotchas that are so painful I am looking for a replacement with a different featureset than couchdb provides. The location tracking project icecondor.com uses couchdb to store 20,000 new records per day. It has more write traffic than read traffic and runs on modest hardware. Those two gotchas are:
1. View Index updates.
While I have a vague understanding of why view index updates are slow and bulky and important, in practice it is unworkable. Every write sets up a trap for the first reader to come along after the write. The more writes there are, the bigger the trap for the first reader which has to wait on the couchdb process that refreshes the view index on an as-needed basis. I believe this trade-off was made to keep writes fast. No need to update the view index until all writes are actually complete, right? Write traffic is heavier than read traffic and the time needed for that index refresh causes the webapp to crash because its not setup to handle timeouts from a database query. The workaround is as hackish as one can imagine - cron jobs to hit every map/reduce query to keep indexes fresh.
2. Append only database file
Append only is in theory a great way to ensure on-disk reliability. A system crash during an append should only affect that append. Its a crash during an update to existing parts of the file that risks the integrity of more than whats being updated. With so many layers of caching and optimizations in the kernel and the filesystem and now in the workings of SSD drives, I'm not sure append-only gives extra protection anymore.
What it does do is a create a huge operational headache. The on-disk file can never grow beyond half the available storage space. Record deletion uses new disk space and if the half-full mark approaches, vacuuming must be done. The entire database is rewritten to the filesystem, leaving out no longer needed records. If the data file should happen to grow beyond half the partition, the system has esentially crashed because there is no way to compact the file and soon the partition will be full. This is a likely scenario when there is a lot of record deletion activity.
The system in question does a lot of writes of temporary data that is followed up by deletes a few days later. There is also a lot of permanent storage that hardly gets used. Rewriting every byte of the records that are long-lived due to compaction is an enormous amount of wasted I/O - doubly so given SSD drives have a short write-cycle lifespan.
]]>nosql couchdb consistency checkpointing databases data-stores indexinghttps://pinboard.in/https://pinboard.in/u:jm/b:75d8509da75f/The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases [PDF]2013-01-08T10:44:22+00:00
http://www-db.in.tum.de/~leis/papers/ART.pdf
jmvia:fanf data-structures trees indexing cache-aware trieshttps://pinboard.in/https://pinboard.in/u:jm/b:9ea453ccb976/A fast, fuzzy, full-text index using Redis2010-05-05T17:08:53+00:00
http://playnice.ly/blog/2010/05/05/a-fast-fuzzy-full-text-index-using-redis/
jmmetaphone sounds-like indexing python redis search full-text fuzzyhttps://pinboard.in/u:jm/b:1a3b4637d2f5/Damn Cool Algorithms: Spatial indexing2009-11-09T19:27:40+00:00
http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves
jmalgorithms mapping gis indexing quadtree datastructures spatial geometryhttps://pinboard.in/u:jm/b:7e78365bbdf8/