Binary coding in stream

M Ghashami, A Abdullah - arXiv preprint arXiv:1503.06271, 2015 - arxiv.org
arXiv preprint arXiv:1503.06271, 2015arxiv.org
Big data is becoming ever more ubiquitous, ranging over massive video repositories,
document corpuses, image sets and Internet routing history. Proximity search and clustering
are two algorithmic primitives fundamental to data analysis, but suffer from the" curse of
dimensionality" on these gigantic datasets. A popular attack for this problem is to convert
object representations into short binary codewords, while approximately preserving near
neighbor structure. However, there has been limited research on constructing codewords in …
Big data is becoming ever more ubiquitous, ranging over massive video repositories, document corpuses, image sets and Internet routing history. Proximity search and clustering are two algorithmic primitives fundamental to data analysis, but suffer from the "curse of dimensionality" on these gigantic datasets. A popular attack for this problem is to convert object representations into short binary codewords, while approximately preserving near neighbor structure. However, there has been limited research on constructing codewords in the "streaming" or "online" settings often applicable to this scale of data, where one may only make a single pass over data too massive to fit in local memory. In this paper, we apply recent advances in matrix sketching techniques to construct binary codewords in both streaming and online setting. Our experimental results compete outperform several of the most popularly used algorithms, and we prove theoretical guarantees on performance in the streaming setting under mild assumptions on the data and randomness of the training set.
arxiv.org