Abstract
Nowadays contents of the web multiply everyday. However, for particular company or individual, some kind of information has higher priority. For example, among so much information on the internet, web pages containing academic papers are definitely more attractive to a researcher. And the problem lies in how to find that kind of data. Therefore we design a spider that targets only on online academic papers. Besides reserving three major parts of a traditional spider, we make some modifications on Filter and Parser so that our spider is competent enough to accomplish the mission. And the essential mechanism of recognizing and extracting expected pages primarily lies on keyword-matching and Finite State Machine Theory. After roaming on two web sites, the spider successfully collects desirable information. We can safely see from the result that in future by optimization and modification this theme-based spider may work more efficiently or even expands to other fields of interest.
This work are supported by Natural Science Foundation of China (Nos. 61472381, 61472382, 61572454 and 61174144), NOE-Micrsoft Key Laboratory of Multimedia Computing and Communication Foundation, Anhui Province Key Laboratory of Software in Computing and Communication.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cheong, F.C.: Internet Agents: Spiders, Wanderers, Brokers, and Bots. New Riders Publishing, Indianapolis (1996)
Rennie, J., McCallum, A.K.: Using reinforcement learning to spider the web efficiently. In: ICML 1999 Workshop, Machine Learning in Text Data Analysis (1999)
Jin-hong, L., Yu-liang, L.: Survey on topic-focused web crawler. Appl. Res. Comput. 24(10), 26–29 (2007)
Yuanchao, X., Jianghua, L., Lizhen, L., Yong, G.: Design and implementation of spider on web-based full-text search engine. Control Autom. 23(7–3), 119–121 (2007)
Wang, J., Peng, J.: Design and research of web spider’s structure. Sci. Technol. Inf. 27, 96–99 (2007)
Jia, N., Huang., W.: Non-recursive crawling schema of mobile web spider. J. Xihua Univ.-Nat. Sci. 26(3), 51–53 (2007)
Heaton, J.: Programming Spiders, Bots, and Aggregators in Java. Sybex, San Francisco (2002)
Chau, M., Chen, H.: Personalized and focused web spiders. In: Zhong, N., Liu, J., Yao, Y. (eds.) Web Intelligence, pp. 197–217. Springer, Heidelberg (2003). doi:10.1007/978-3-662-05320-1_10
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: URLs of Search Engines
Appendix: URLs of Search Engines
-
1.
Google: http://www.google.cn
-
2.
Baidu: http://www.baidu.com
-
3.
Science Paper Online: http://www.paper.edu.cn
- 4.
- 5.
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yin, P., Shao, Q., Wang, X., Wang, W., Miao, F., Shao, C. (2017). Theme-Based Spider for Academic Paper. In: Yue, D., Peng, C., Du, D., Zhang, T., Zheng, M., Han, Q. (eds) Intelligent Computing, Networked Control, and Their Engineering Applications. ICSEE LSMS 2017 2017. Communications in Computer and Information Science, vol 762. Springer, Singapore. https://doi.org/10.1007/978-981-10-6373-2_23
Download citation
DOI: https://doi.org/10.1007/978-981-10-6373-2_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6372-5
Online ISBN: 978-981-10-6373-2
eBook Packages: Computer ScienceComputer Science (R0)