How to find the most frequent items in count-min sketch?

08 August 2014 5 4K Report

Count-min sketch is very influential and is wonderful invention by Graham Cormode and Muthukrishnan. It has many application, one of which is finding the most frequent items in data streams.

I don't understand how the identity of the items are recovered from the sketch. Since the technique uses hash functions, which are uni-direction. This mean that given an item X and a hash function H, we have H(X)==>y. Now there is now way that if you have "H" and "y" you can recover "X".

Since count-min sketch uses hash functions to map items to sketch. Thus after processing enough items, say millions of items, given the sketch and the hash functions, how they find the frequent items, if they don't know the items. Well the answer is trivial if you know the items.

So I really don't understand, how they manage to do so. I have been reading different papers but can't understand it. I will be thankful if you can explain it.

Article An improved data stream summary: The Count-Min Sketch and it...

Fabrice Clerot

you just do it on line : while maintaining your sketch, you keep a list of your heavy hitters seen so far

every time you update your CMS (every time your stream gives you an event (key-value)),

- you maintain the total stream count so far

- you check if the new (estimated) agregate value for the event in the CMS makes it part of the HH ; if "key" is already a HH, just update the (estimated" agregate count ; if "key" is not already a HH, insert it in the HH laist with its (estimated) agregate value ; finally clean the HH list from the elements which are not HH anymore (this cleaning part need not be done at every event, the other steps must be done on every event)

- at any point in time (therefore also at the end of the stream), your HH list contains all the true heavy hitters ; moreover, with probability (1-d), it contains only keys with true agregate counts greater than (f-e)*A where A is the total stream count, f is the target level for the HH and (d, e) the parameters for the CMS

Zubair Shah

Thanks Fabrice Clerot for your cooperation.

Zubair Shah

Thanks Christian Sohler for your explanation. I have also read this in the paper "Finding frequent items in data streams" http://dl.acm.org/citation.cfm?id=1454225. But could not understand the technique. First of all in the data stream usually the universe is not known. And if let say a guess or some probabilistic approach is used to approximate the value of U. Then the question is how this representation work to recover the key.

I really don't understand the mechanism of using these additional counters to retrieve the keys for HH.

I will be very thankful if you can explain this (if possible with some toy example).

Fabrice Clerot

if you are in a setting which does not allow the on line procedure (change detection on a stream by comparing sketches associated with jumping windows, for instance), you might be interested in reading the attached paper :

Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams

Robert Schweller, Ashish Gupta, Elliot Parsons, Yan Chen

ABSTRACT

Traffic anomalies such as failures and attacks are increasing in frequency and severity, and thus identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows and looks for heavy changes in traffic patterns (e.g., volume, number of connections). However, as link speeds and the number of flows increase, keeping per-flow state is not scalable. The recently proposed sketch-based schemes [14] are among the very few that can detect heavy changes

and anomalies over massive data streams at network traffic speeds. However, sketches do not preserve the key (e.g., source IP address) of the flows. Hence, even if anomalies are detected, it is difficult to infer the culprit flows, making it a big practical hurdle for online deployment. Meanwhile, the number of keys is too large to record.

To address this challenge, we propose efficient reversible hashing algorithms to infer the keys of culprit flows from sketches without storing any explicit key information. No extra memory or memory accesses are needed for recording the streaming data. Meanwhile, the heavy change detection daemon runs in the background with space complexity and computational time sublinear to the key space size. This short paper describes the conceptual framework of the reversible sketches, as well as some initial approaches for implementation. See [23] for the optimized algorithms in details. Evaluated with netflow traffic traces of a large edge router, we demonstrate that the reverse hashing can quickly infer the keys of culprit flows even for many changes with high accuracy.

Zubair Shah

Thanks Fabrice Clerot, Your replies are very helpful. I will go through the paper in order to get the idea.

Any advice on the Error bound of dual space saving algorithm?

Can space saving algorithm be designed without stream summary data structure?

What is the importance of RG score? Is it useful for decisions on research under review?

Is there a generic expression to represent the multiplication of a distribution ?

Any open source implementation for Deterministic Waves algorithm for Basic Counting of bit stream over sliding window?

What is the update cost of Lossy Counting per element?

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Do you know best mines of western part of Afghanistan?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

How can I interpret the data without the need of solving it manually?

Why can't academics earn the money they deserve?

Conjugation of PEG-Amine to an Amino Acid Using EDC?