How does one access time in key/value database scale with a number of entries?

KV stores are optimized for retrieving the item indexed by your key - but your payload is just one bit, right? Can you accept a probabilistic answer (in which case, just use Bloom Filter, no)?

For hash tables, radix trees, etc, it will matter how big your keys are, not just how many of them you have.

I'm not sure why you're skeptical of constant-time access...

Denis Havlik

Dear Mark,

I'm sceptical because I have a background in experimental physics. :-)

One thing I learned at the uni is that "free lunch" is against the basic principles. Consequently, I presume that anything with O(1) scaling will certainly break at some point, e.g. because I run out of memory. In which case I would like to know when and why in advance.

Btw., is access time in a KV store really independent on the number of entries? This would be the case if the unique keys were just incremented by one every time a new item is added because then I know exactly where each item is.

Unfortunately we are talking about user-generated keys here. In my opinion this means that I have to search for an item in an ordered list. The fastest search algorithm for ordered lists is AFAIK a binary search, which scales as O(log(N)).

Unless I do a mapping of URLs to Numbers 1-10^N and generate an ordered set with all numbers to start with. In which case, the payload could consists of all URLs which map to this number. Or of unique keys of such URLs in another DB where I simply use incremental numbers for each new URL. Which could in fact scale really well, maybe that's the way to go?

What am I missing here?

---

Back to your answer: I'm afraid that your knowledge level is way above mine here. As a consequence, I don't understand what you mean by "probabilistic answer". Could you elaborate?

As for the payload, "it depends". I do need some additional info, but I could design a system with two sets of data if this brings a noticeable speed difference.. One with no payload, and the second (which is called much less often) wit payload.

Les Cardwell

Hi Dennis,

You might be able to use a multi-dimensional 'key' to implicitly partition the data. I've written the precept for a book that you can download at www.whiteboxinc.com called Abstract Normalization, although I'm attaching a copy of my dissertation which focused on the key aspect of the method. The tighter you partition the 'value', the better the result, and it scales well. Let me know if you have any questions.

Dr. Les Cardwell, DCS-DSS

Mark Hahn

Dennis, one of the most common data structures is a hash table, whose defining feature is O(1) operation. Naturally, you can break any data structure by exceeding the capability of the infrastructure hosting it, but that just means you planned wrong, not that the datastructure's time analysis is wrong. Basically, hash tables work by defining a function that derives a unique numeric value for each key, and maps that key into an array. Particular forms of hash tables differ in exactly how they perform that mapping, how they deal with filling up, how they resolve keys which produce related hash values, etc.

In the simple case, you would indeed just store your data in a simple key-value service. These are basically glorified hash tables (though some actually use b-tree-ish structures internally - mainly to deal with expansion.) As such, they should normally deal well with failing lookups (which you say is the most common). That means that you can just shove your data into a KV store and be happy. Naturally, once you exceed the capacity of one node, you'll need to scale out, which is why you noticed discussion about adding nodes.

I mentioned Bloom Filters: they are an interesting take on hash tables. A basic hash table will hash your key and index into a single array. (That cell may, for "open hashing", simply be the head of a list, which then must be searched, but whose average length can be controlled by sizing the hash array, thus remaining O(1).) Other variants might use more than one hash function to do this "collision resolution". A Bloom Filter is really just a set of hash functions, which are all applied to the same hash array, itself just a bitvector in most basic form. If your key hashes to a set bit when indexed by all the hash functions, then your key is considered almost certainly present. You can tune a Bloom Filter to achieve whatever bounds you like on "almost certainly" - the real trick is that the false-positive rate is the *product* of the chances for each individual hash. And of course, the storage overhead is potentially small (only proportional to the number of present keys). I think of Bloom Filters as a quick check for key absence - great for augmenting a conventional datastructure that actually *stores* the keys and associated values.

Please do not assume your problem is equivalent to searching an ordered list!

How to remove duplicate citations?

Where is geo-referenced OpenBigData?

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

How can I interpret the data without the need of solving it manually?

Why can't academics earn the money they deserve?

Conjugation of PEG-Amine to an Amino Acid Using EDC?

How Do Project Data Analytics and AI Advance Quality 4.0 in Construction Project Management?