Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2539150.2539161acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

An extensive study of Web robots traffic

Published: 02 December 2013 Publication History

Abstract

The traffic produced by the periodic crawling activities of Web robots often represents a good fraction of the overall websites traffic, thus causing some non-negligible effects on their performance. Our study focuses on the traffic generated on the SPEC website by many different Web robots, including, among the others, the robots employed by some popular search engines. This extensive investigation shows that the behavior and crawling patterns of the robots vary significantly in terms of requests, resources and clients involved in their crawling activities. Some robots tend to concentrate their requests in short periods of time and follow some sorts of deterministic patterns characterized by multiple peaks. The requests of other robots exhibit a time dependent behavior and repeated patterns with some periodicity. We represent the traffic as a time series modelled in the frequency domain. The identified models, consisting of trigonometric polynomials and Auto Regressive Moving Average components, accurately summarize the behavior of the overall traffic as well as the traffic of individual robots. These models can be easily used as a basis for forecasting.

References

[1]
P. Bloomfield. Fourier analysis of time series: an introduction. Wiley, 2004.
[2]
G. E. P. Box and G. M. Jenkins. Time Series Analysis, Forecasting, and Control. Holden-Day, 1976.
[3]
D. R. Brillinger. Time series: data analysis and theory, volume 36 of Classics in Applied Mathematics. SIAM, 2001.
[4]
M. Calzarossa and L. Massari. Analysis of Web logs: Challenges and findings. In K. Hummel, H. Hlavacs, and W. Gansterer, editors, Performance Evaluation of Computer and Communication Systems - Milestones and Future Challenges, volume 6821 of Lecture Notes in Computer Science, pages 227--239. Springer, 2011.
[5]
M. Calzarossa and L. Massari. Temporal analysis of crawling activities of commercial Web robots. In E. Gelenbe and R. Lent, editors, Computer and Information Sciences III, Lecture Notes in Electrical Engineering, pages 429--436. Springer, 2012.
[6]
M. Calzarossa and D. Tessera. Time series analysis of the dynamics of news websites. In Proc. 13th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pages 529--533. IEEE Computer Society Press, 2012.
[7]
M. Dikaiakos, A. Stassopoulou, and L. Papageorgiou. An investigation of web crawler behavior: characterization and metrics. Computer Communications, 28(8):880--897, 2005.
[8]
D. Doran and S. Gokhale. Detecting Web Robots Using Resource Request Patterns. In Proc. 11th International Conference on Machine Learning and Applications (ICMLA), pages 7--12, 2012.
[9]
A. Koehl and H. Wang. Surviving a search engine overload. In Proc. of the 21st international conference on World Wide Web, WWW '12, pages 171--180. ACM, 2012.
[10]
M. Koster. A Method for Web Robots Control. Network Working Group - Internet Draft, 1996.
[11]
S. Kwon, M. Oh, D. Kim, J. Lee, Y.-G. Kim, and S. Cha. Web Robot Detection based on Monotonous Behavior. In Proc. International Conference on Information Science and Industrial Applications, pages 43--48, 2012.
[12]
J. Lee, S. Cha, D. Lee, and D. Lee. Classification of web robots: An empirical study based on over one billion requests. Computers & Security, 28(8):795--802, 2009.
[13]
C. Olston and M. Najork. Web Crawling. Journal of Foundations and Trends in Information Retrieval, 4(3):175--246, 2010.
[14]
SPEC corporate website. http://www.spec.org.
[15]
D. Stevanovic, N. Vlajic, and A. An. Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Appl. Soft Comput., 13(1):698--708, 2013.
[16]
Y. Sun, I. G. Councill, and C. Giles. The Ethicality of Web Crawlers. In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pages 668--675. IEEE Computer Society, 2010.
[17]
M. Thelwall and D. Stuart. Web crawling ethics revisited: Cost, privacy, and denial of service. Journal of the American Society for Information Science and Technology, 57(13):1771--1779, 2006.
[18]
User-agent-string.info. http://user-agent-string.info/list-of-ua/bots, Last visited: May 16, 2013.

Cited By

View all
  • (2021)Good Bot, Bad Bot: Characterizing Automated Browsing Activity2021 IEEE Symposium on Security and Privacy (SP)10.1109/SP40001.2021.00079(1589-1605)Online publication date: May-2021
  • (2018)Some (Non-)universal features of Web robot traffic2018 52nd Annual Conference on Information Sciences and Systems (CISS)10.1109/CISS.2018.8362266(1-6)Online publication date: Mar-2018
  • (2016)Workload CharacterizationACM Computing Surveys10.1145/285612748:3(1-43)Online publication date: 8-Feb-2016
  • Show More Cited By

Index Terms

  1. An extensive study of Web robots traffic

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      IIWAS '13: Proceedings of International Conference on Information Integration and Web-based Applications & Services
      December 2013
      753 pages
      ISBN:9781450321136
      DOI:10.1145/2539150
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      In-Cooperation

      • @WAS: International Organization of Information Integration and Web-based Applications and Services

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 December 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Web log analysis
      2. Web mining
      3. Web robot
      4. Web traffic characterization
      5. crawling pattern
      6. time series analysis

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      IIWAS '13

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 18 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Good Bot, Bad Bot: Characterizing Automated Browsing Activity2021 IEEE Symposium on Security and Privacy (SP)10.1109/SP40001.2021.00079(1589-1605)Online publication date: May-2021
      • (2018)Some (Non-)universal features of Web robot traffic2018 52nd Annual Conference on Information Sciences and Systems (CISS)10.1109/CISS.2018.8362266(1-6)Online publication date: Mar-2018
      • (2016)Workload CharacterizationACM Computing Surveys10.1145/285612748:3(1-43)Online publication date: 8-Feb-2016
      • (2015)Smart CrawlerProceedings of the 21st Brazilian Symposium on Multimedia and the Web10.1145/2820426.2820437(125-132)Online publication date: 27-Oct-2015
      • (2014)Analysis of Header Usage Patterns of HTTP Request MessagesProceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)10.1109/HPCC.2014.146(847-853)Online publication date: 20-Aug-2014

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media