Count Min sketch is a simple technique to summarize large amounts of frequency data. which is widely used in many places where there is a streaming big data.
Donate/Patreon: [ Ссылка ]
CODE:
----------------------------------------------------------------------------
By Varun Vats: [ Ссылка ]
Applications of count min sketch:
----------------------------------------------------------------------------
[ Ссылка ]
[ Ссылка ]
[ Ссылка ]
Applications using Count Tracking There are dozens of applications of count tracking and in particular, the Count-Min sketch datastructure that goes beyond the task of approximating data distributions. We give three examples.
1. A more general query is to identify the Heavy-Hitters, that is, the query HH(k) returns theset of items which have large frequency (say 1/k of the overall frequency). Count trackingcan be used to directly answer this query, by considering the frequency of each item. Whenthere are very many possible items, answering the query in this way can be quite slow. Theprocess can be sped up immensely by keeping additional information about the frequenciesof groups of items [6], at the expense of storing additional sketches. As well as being ofinterest in mining applications, finding heavy-hitters is also of interest in the context of signalprocessing. Here, viewing the signal as defining a data distribution, recovering the heavy-hitters is key to building the best approximation of the signal. As a result, the Count-Minsketch can be used in compressed sensing, a signal acquisition paradigm that has recentlyrevolutionized signal processing [7].
2. One application where very large data sets arise is in Natural Language Processing (NLP).Here, it is important to keep statistics on the frequency of word combinations, such as pairsor triplets of words that occur in sequence. In one experiment, researchers compacted a large6
Page 7
90GB corpus down to a (memory friendly) 8GB Count-Min sketch [8]. This proved to be justas effective for their word similarity tasks as using the exact data.
3. A third example is in designing a mechanism to help users pick a safe password. To makepassword guessing difficult, we can track the frequency of passwords online and disallowcurrently popular ones. This is precisely the count tracking problem. Recently, this wasput into practice using the Count-Min data structure to do count tracking (see [ Ссылка ]). A nice feature of this solution is that the impactof a false positive—erroneously declaring a rare password choice to be too popular and sodisallowing it—is only a mild inconvenience to the user
Ещё видео!