research-article

RetriBlog: a framework for creating blog crawlers

Authors:

Rafael Ferreira,

Henrique PaccaAuthors Info & Claims

SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing

Pages 696 - 701

https://doi.org/10.1145/2245276.2245408

Published: 26 March 2012 Publication History

Abstract

Blogs are becoming an important social tool. By means of blogs, bloggers share their likes and dislikes, express their opinions, report news and form groups related to some subjects. Thus, the available information on the Blogsphere can certainly helps in the creation of interesting applications in various domains, such as e-learning, e-commerce, and e-government. However, due to the increasing number of blogs posted every day on the Web, and the dynamic nature of the Blogsphere, the tasks of collecting and extracting relevant information from blogs have become hard and time consuming. In this paper, we use techniques both from information retrieval and information extraction fields to deal with this problem. Since the blogs have many points of variability it is necessary to provide applications that can be easily adapted. We present the RetriBlog system, a framework for the development of blog crawlers dealing the variations in blogs. This paper presents the RetriBlog details and an evaluation of the proposed algorithms.

References

[1]

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1st edition, May 1999.

Digital Library

[2]

M. Chau, J. Xu, J. Cao, P. Lam, and B. Shiu. A blog mining framework. IT Professional, 11: 36--41, January 2009.

Digital Library

[3]

W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall PTR, June 1992.

Digital Library

[4]

K. Fujimura, H. Toda, T. Inoue, N. Hiroshima, R. Kataoka, and M. Sugizaki. Blogranger - a multi-faceted blog search engine. In 3rd Annual Workshop on the Weblogging Ecosystem, 2006.

[5]

S. A. Golder and B. A. Huberman. Usage patterns of collaborative tagging systems. Journal of Information Science, 32: 198--208, April 2006.

Digital Library

[6]

T. Gottron. Evaluating content extraction on html documents. Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123--132, 2007.

[7]

E. Hatcher and O. Gospodnetic. Lucene in Action (In Action series). Manning Publications, December 2004.

Digital Library

[8]

A. Hotho, A. Nürnberger, and G. Paaß. A brief survey of text mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology, 20(1): 19--62, May 2005.

[9]

M. Joshi. Blogharvest: Blog mining and search framework. In In: Proc. of the International Conf. on Management of Data COMAD, 2006.

[10]

C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. The Third ACM International Conference on Web Search and Data Mining, 2010.

Digital Library

[11]

Y. S. Li Baoli and L. Qin. An improved k-nearest neighbor algorithm for text categorization. Proceedings of the 20th international conference on computer processing of oriental languages, 2003.

[12]

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 1 edition, July 2008.

Digital Library

[13]

F. P. Miller, A. F. Vandome, and J. McBrewster. Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau? Levenshtein distance, Spell checker, Hamming distance. Alpha Press, 2009.

Digital Library

[14]

T. Mitchell. Maching Learning. McGraw-Hill education, 1997.

Digital Library

[15]

D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. Quasm: a system for question answering using semi-structured data. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, JCDL '02, pages 46--55, New York, NY, USA, 2002. ACM.

Digital Library

[16]

M. F. Porter. An algorithm for suffix stripping. pages 313--316, 1997.

Digital Library

[17]

H. Qian and C. R. Scott. Anonymity and self-disclosure on weblogs. Journal of Computer-Mediated Communication, 12: 1, 2007.

[18]

F. Sebastiani and C. N. Delle Ricerche. Machine learning in automated text categorization. ACM Computing Surveys, 34: 1--47, 2002.

Digital Library

[19]

Technorati. State of the blogosphere 2008. http://technorati.com/blogging/feature/state-of-the-blogosphere-2008/, 2008. Accessed on March 2011.

[20]

T. Weninger and W. H. Hsu. Text extraction from the web via text-to-tag ratio. In Proceedings of the 2008 19th International Conference on Database and Expert Systems Application, pages 23--28, Washington, DC, USA, 2008. IEEE Computer Society.

Digital Library

Cited By

Koloveas PChantzios TAlevizopoulou SSkiadopoulos STryfonopoulos C(2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
https://doi.org/10.3390/electronics10070818
Koloveas PChantzios TTryfonopoulos CSkiadopoulos S(2019)A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00016(3-8)Online publication date: Jul-2019
https://doi.org/10.1109/SERVICES.2019.00016
Rainer AWilliams A(2019)Using blog‐like documents to investigate software practice: Benefits, challenges, and research directionsJournal of Software: Evolution and Process10.1002/smr.2197Online publication date: 29-Aug-2019
https://doi.org/10.1002/smr.2197
Show More Cited By

Index Terms

RetriBlog: a framework for creating blog crawlers

Recommendations

An architecture-centered framework for developing blog crawlers
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing

Blogs have become interesting tools for knowledge generation and sharing. As a matter of fact, the activity on blogs doubles every two hundred days. Numerous applications could make use of this massive daily information in order to find out interesting ...
Identifying the influential bloggers in a community
WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining

Blogging becomes a popular way for a Web user to publish information on the Web. Bloggers write blog posts, share their likes and dislikes, voice their opinions, provide suggestions, report news, and form groups in Blogosphere. Bloggers form their ...
RetriBlog: An architecture-centered framework for developing blog crawlers

Blogs have become an important social tool. It allows the users to share their tastes, express their opinions, report news, form groups related to some subject, among others. The information obtained from the blogosphere may be used to create several ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing

March 2012

2179 pages

ISBN:9781450308571

DOI:10.1145/2245276

Conference Chairs:
Sascha Ossowski
University Rey Juan Carlos, Spain
,
Paola Lecca
The Microsoft Research - University of Trento COSBI, Italy

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SAC 2012

Sponsor:

SIGAPP

SAC 2012: ACM Symposium on Applied Computing

March 26 - 30, 2012

Trento, Italy

Acceptance Rates

SAC '12 Paper Acceptance Rate 270 of 1,056 submissions, 26%;

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
130
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Koloveas PChantzios TAlevizopoulou SSkiadopoulos STryfonopoulos C(2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
https://doi.org/10.3390/electronics10070818
Koloveas PChantzios TTryfonopoulos CSkiadopoulos S(2019)A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00016(3-8)Online publication date: Jul-2019
https://doi.org/10.1109/SERVICES.2019.00016
Rainer AWilliams A(2019)Using blog‐like documents to investigate software practice: Benefits, challenges, and research directionsJournal of Software: Evolution and Process10.1002/smr.2197Online publication date: 29-Aug-2019
https://doi.org/10.1002/smr.2197
Gupta KGoel A(2013)Framework for Blog Software in Web ApplicationAdvances in Computing, Communication, and Control10.1007/978-3-642-36321-4_17(187-198)Online publication date: 2013
https://doi.org/10.1007/978-3-642-36321-4_17

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents