Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Parallelization of Massive Textstream Compression Based on Compressed Sensing

Published: 21 August 2017 Publication History

Abstract

Compressing textstreams generated by social networks can both reduce storage consumption and improve efficiency such as fast searching. However, the compression process is a challenge due to the large scale of textstreams. In this article, we propose a textstream compression framework based on compressed sensing theory and design a series of matching parallel procedures. The new approach uses a linear projection technique in the textstream compression process, achieving fast compression speed and low compression ratio. Two processes are executed by designing elaborated parallel procedures for efficient compressing and decompressing of large-scale textstreams. The decompression process is implemented for approximate solutions of underdetermined linear systems. Experimental results show that the new method can efficiently achieve the compression and decompression tasks on a large amount of text generated by social networks.

References

[1]
Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne. 2008. Finding high-quality content in social media. In Proceedings of the International Conference on Web Search and Data Mining (WSDM’08). ACM Press, New York, NY, 183--193
[2]
Michal Aharon, Michael Elad, and Alfred M. Bruckstein. 2006. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Sign. Process. 54, 11 (2006), 4311--4322.
[3]
Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin. 2008. A simple proof of the restricted isometry property for random matrices. Construct. Approx. 28, 3 (2008), 253--263.
[4]
Ella Bingham and Heikki Mannila. 2001. Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01). ACM Press, New York, NY, 245--250.
[5]
N. R. Brisaboa and A. Faria. 2010. Dynamic lightweight text compressionACM Trans. Inf. Syst. 28, 3 (2010), 1--32.
[6]
N. R. Brisaboa, A. Faria, G. Navarro, and J. R. Parama. 2008. New adaptive compressors for natural language text. Software Prac. Exper. 38, 13 (2008), 1429--1450.
[7]
Nieves R. Brisaboa, Antonio Faria, Gonzalo Navarro, and Jos R. Param. 2007. Lightweight natural language text compression. Inf. Retriev. 10, 1 (2007), 1--33.
[8]
Emmanuel Cands and Justin Romberg. 2007. Sparsity and incoherence in compressive sampling. Inverse Problems 23, 3 (2007), 969--985.
[9]
Scott Shaobing Chen, D. L. Donoho, and Michael A. Saunders. 1999. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20, 1 (1999), 33--61.
[10]
Chih Yi Chiu, Tsung Han Tsai, Guei Wun Han, Cheng Yu Hsieh, and Sheng Yang Li. 2013. Efficient video stream monitoring for near-duplicate detection and localization in a large-scale repository. ACM Trans. Inf. Syst. 31, 4 (2013), 84--87.
[11]
Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. 2000. Fast and flexible word searching on compressed text. ACM Trans. Inf. Syst. 18, 2 (2000), 113--139.
[12]
Jeffrey Dean and Sanjay Ghemawat. 2008. Mapreduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[13]
David L. Donoho. 2006. Compressed sensing. IEEE Trans. Inf. Theory 52, 4 (2006), 1289--1306.
[14]
Jiajia Huang, Min Peng, Hua Wang, Jinli Cao, Wang Gao, and Xiuzhen Zhang. 2017. A probabilistic method for emerging topic tracking in microblog stream. World Wide Web: Internet Web Inf. Syst. 20, 2 (2017), 325--350.
[15]
Christian S. Jensen, Dan Lin, Beng Chin Ooi, and Rui Zhang. 2006. Effective density queries on continuously moving objectsIn Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). IEEE, Los Alamitos, CA, 71--81.
[16]
S. P. Kasiviswanathan, G. Cong, P. Melville, and R. D. Lawrence. 2013. Novel document detection for massive data streams using distributed dictionary learning. IBM J. Res. Dev. 57, 3--4 (2013), 1--15.
[17]
Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, and Prem Melville. 2012. Online l1-dictionary learning with application to novel document detection. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12). MIT Press, Cambridge, MA, 2267--2275.
[18]
Daniel Kifer, Shai Ben-David, and Johannes Gehrke. 2004. Detecting change in data streams. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB’04). Morgan Kaufmann, San Mateo, CA, 180--191.
[19]
Nick Koudas, Beng Chin Ooi, Kian Lee Tan, and Rui Zhang. 2004. Approximate NN queries on streams with guaranteed error/performance bounds. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB’04). Morgan Kaufmann, San Mateo, CA, 804--815.
[20]
S. G. Mallat and Zhifeng Zhang. 1993. Matching pursuits with time-frequency dictionaries. IEEE Trans. Sign. Process. 41, 12 (1993), 3397--3415.
[21]
Zhongchen Miao, Kai Chen, Yi Fang, Jianhua He, Yi Zhou, Wenjun Zhang, and Hongyuan Zha. 2016. Cost-effective online trending topic detection and popularity prediction in microblogging. ACM Trans. Inf. Syst. 35, 3 (2016), 18:1--18:36.
[22]
Alistair Moffat. 1989. Word-based text compression. Software Prac. Exper. 19, 2 (1989), 185--198.
[23]
Ramesh Nallapati, Bowen Zhou, Cicero Nogueira Dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL’16). ACL, Stroudsburg, PA, 280--290.
[24]
Y. C. Pati, R. Rezaiifar, and P. S. Krishnaparsad. 1993. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference on Signals Systems and Computers (ACSSC’93). IEEE, Los Alamitos, CA, 40--44.
[25]
Min Peng, Jiajia Huang, and Hui Fu. 2013. High quality microblog extraction based on multiple features fusion and time-frequency transformation. In Proceedings of the 14th International Conference on Web Information System Engineering (WISE’13). Springer, New York, NY, 188--201.
[26]
Arno Siebes and Jilles Vreeken. 2006. Item sets that compress. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM’06). SIAM, Philadelphia, PA, 393--404.
[27]
Xiaoxun Sun, Hua Wang, Jiuyong Li, and Jian Pei. 2011. Publishing anonymous survey rating data. Data Min. Knowl. Discov. 23, 3 (2011), 379--406.
[28]
Zhongyuan Wang, Fang Wang, Haixun Wang, Zhirui Hu, Jun Yan, Fangtao Li, Ji-Rong Wen, and Zhoujun Li. 2016. Unsupervised head--modifier detection in search queries. ACM Trans. Knowl. Discov. Data 11, 2 (2016), 19:1--19:28.
[29]
J. Wen, J. Y. Nie, and H. Zhang. 2002. Query clustering using user logs. ACM Trans. Inf. Syst. 20, 1 (2002), 59--81.
[30]
Yang Xiang, Ruoming Jin, David Fuhry, and Feodor F. Dragan. 2008. Succinct summarization of transactional databases: An overlapped hyper rectangle scheme. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). ACM Press, New York, NY, 758--766.
[31]
Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. 2005. Mining compressed frequent-pattern sets. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05). ACM Press, New York, NY, 709--720.
[32]
Guandong Xu, Yanchun Zhang, and Lin Li. 2010. Web Mining and Social Networking: Techniques and Applications (1st ed.). Springer-Verlag New York, Inc., New York, NY.
[33]
Xintian Yang, Amol Ghoting, Yiye Ruan, and Srinivasan Parthasarathy. 2012. A framework for summarizing and analyzing Twitter feeds. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). ACM Press, New York, NY, 370--378.
[34]
Xintian Yang, Yiye Ruan, Srinivasan Parthasarathy, and Amol Ghoting. 2013. Summarization via pattern utility and ranking: A novel framework for social media data analytics. Bull. IEEE Comput. Soc. Techn. Comm. Data Eng. 36, 3 (2013), 67--76.
[35]
Rui Zhang, Nick Koudas, Beng Chin, and Ooi Divesh Srivastava. 2005. Multiple aggregations over data stream. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’05). ACM Press, New York, NY, 299--310.
[36]
Rui Zhang, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, and Pu Zhou. 2010. Streaming multiple aggregations using phantoms. VLDB J. 19, 4 (2010), 57--583.
[37]
Rui Zhang, Dan Lin, Kotagiri Ramamohanarao, and Elisa Bertino. 2008. Continuous intersection joins over moving objects. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE’08). IEEE, Los Alamitos, CA, 863--872.
[38]
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337--343.

Cited By

View all
  • (2023)Hierarchical Planning and Control for Box Loco-ManipulationProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36069316:3(1-18)Online publication date: 24-Aug-2023
  • (2023)Physics-based Motion Retargeting from Sparse InputsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36069286:3(1-19)Online publication date: 24-Aug-2023
  • (2023)MAAIPProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36069266:3(1-20)Online publication date: 24-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 36, Issue 2
April 2018
371 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3133943
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2017
Accepted: 01 April 2017
Revised: 01 January 2017
Received: 01 September 2016
Published in TOIS Volume 36, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Text stream compression
  2. compressed sensing
  3. parallelization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)2
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Hierarchical Planning and Control for Box Loco-ManipulationProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36069316:3(1-18)Online publication date: 24-Aug-2023
  • (2023)Physics-based Motion Retargeting from Sparse InputsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36069286:3(1-19)Online publication date: 24-Aug-2023
  • (2023)MAAIPProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36069266:3(1-20)Online publication date: 24-Aug-2023
  • (2023)Motion In-Betweening with Phase ManifoldsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36069216:3(1-17)Online publication date: 24-Aug-2023
  • (2022)Learning Soccer Juggling Skills with Layer-wise Mixture-of-ExpertsACM SIGGRAPH 2022 Conference Proceedings10.1145/3528233.3530735(1-9)Online publication date: 27-Jul-2022
  • (2022)Generative GaitNetACM SIGGRAPH 2022 Conference Proceedings10.1145/3528233.3530717(1-9)Online publication date: 27-Jul-2022
  • (2022)DeepPhaseACM Transactions on Graphics10.1145/3528223.353017841:4(1-13)Online publication date: 22-Jul-2022
  • (2022)GANimatorACM Transactions on Graphics10.1145/3528223.353015741:4(1-12)Online publication date: 22-Jul-2022
  • (2021)Flexible Motion Optimization with Modulated Assistive ForcesProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34801444:3(1-25)Online publication date: 27-Sep-2021
  • (2021)Efficient Hyperparameter Optimization for Physics-based Character AnimationProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34512544:1(1-19)Online publication date: 28-Apr-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media