Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2245276.2245408acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

RetriBlog: a framework for creating blog crawlers

Published: 26 March 2012 Publication History

Abstract

Blogs are becoming an important social tool. By means of blogs, bloggers share their likes and dislikes, express their opinions, report news and form groups related to some subjects. Thus, the available information on the Blogsphere can certainly helps in the creation of interesting applications in various domains, such as e-learning, e-commerce, and e-government. However, due to the increasing number of blogs posted every day on the Web, and the dynamic nature of the Blogsphere, the tasks of collecting and extracting relevant information from blogs have become hard and time consuming. In this paper, we use techniques both from information retrieval and information extraction fields to deal with this problem. Since the blogs have many points of variability it is necessary to provide applications that can be easily adapted. We present the RetriBlog system, a framework for the development of blog crawlers dealing the variations in blogs. This paper presents the RetriBlog details and an evaluation of the proposed algorithms.

References

[1]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1st edition, May 1999.
[2]
M. Chau, J. Xu, J. Cao, P. Lam, and B. Shiu. A blog mining framework. IT Professional, 11: 36--41, January 2009.
[3]
W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall PTR, June 1992.
[4]
K. Fujimura, H. Toda, T. Inoue, N. Hiroshima, R. Kataoka, and M. Sugizaki. Blogranger - a multi-faceted blog search engine. In 3rd Annual Workshop on the Weblogging Ecosystem, 2006.
[5]
S. A. Golder and B. A. Huberman. Usage patterns of collaborative tagging systems. Journal of Information Science, 32: 198--208, April 2006.
[6]
T. Gottron. Evaluating content extraction on html documents. Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123--132, 2007.
[7]
E. Hatcher and O. Gospodnetic. Lucene in Action (In Action series). Manning Publications, December 2004.
[8]
A. Hotho, A. Nürnberger, and G. Paaß. A brief survey of text mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology, 20(1): 19--62, May 2005.
[9]
M. Joshi. Blogharvest: Blog mining and search framework. In In: Proc. of the International Conf. on Management of Data COMAD, 2006.
[10]
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. The Third ACM International Conference on Web Search and Data Mining, 2010.
[11]
Y. S. Li Baoli and L. Qin. An improved k-nearest neighbor algorithm for text categorization. Proceedings of the 20th international conference on computer processing of oriental languages, 2003.
[12]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 1 edition, July 2008.
[13]
F. P. Miller, A. F. Vandome, and J. McBrewster. Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau? Levenshtein distance, Spell checker, Hamming distance. Alpha Press, 2009.
[14]
T. Mitchell. Maching Learning. McGraw-Hill education, 1997.
[15]
D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. Quasm: a system for question answering using semi-structured data. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, JCDL '02, pages 46--55, New York, NY, USA, 2002. ACM.
[16]
M. F. Porter. An algorithm for suffix stripping. pages 313--316, 1997.
[17]
H. Qian and C. R. Scott. Anonymity and self-disclosure on weblogs. Journal of Computer-Mediated Communication, 12: 1, 2007.
[18]
F. Sebastiani and C. N. Delle Ricerche. Machine learning in automated text categorization. ACM Computing Surveys, 34: 1--47, 2002.
[19]
Technorati. State of the blogosphere 2008. http://technorati.com/blogging/feature/state-of-the-blogosphere-2008/, 2008. Accessed on March 2011.
[20]
T. Weninger and W. H. Hsu. Text extraction from the web via text-to-tag ratio. In Proceedings of the 2008 19th International Conference on Database and Expert Systems Application, pages 23--28, Washington, DC, USA, 2008. IEEE Computer Society.

Cited By

View all
  • (2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
  • (2019)A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00016(3-8)Online publication date: Jul-2019
  • (2019)Using blog‐like documents to investigate software practice: Benefits, challenges, and research directionsJournal of Software: Evolution and Process10.1002/smr.2197Online publication date: 29-Aug-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing
March 2012
2179 pages
ISBN:9781450308571
DOI:10.1145/2245276
  • Conference Chairs:
  • Sascha Ossowski,
  • Paola Lecca
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. blog crawler
  2. blogosphere
  3. information retrieval

Qualifiers

  • Research-article

Conference

SAC 2012
Sponsor:
SAC 2012: ACM Symposium on Applied Computing
March 26 - 30, 2012
Trento, Italy

Acceptance Rates

SAC '12 Paper Acceptance Rate 270 of 1,056 submissions, 26%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
  • (2019)A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00016(3-8)Online publication date: Jul-2019
  • (2019)Using blog‐like documents to investigate software practice: Benefits, challenges, and research directionsJournal of Software: Evolution and Process10.1002/smr.2197Online publication date: 29-Aug-2019
  • (2013)Framework for Blog Software in Web ApplicationAdvances in Computing, Communication, and Control10.1007/978-3-642-36321-4_17(187-198)Online publication date: 2013

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media