Abstract
The emergence of academic search engines (mainly Google Scholar and Microsoft Academic Search) that aspire to index the entirety of current academic knowledge has revived and increased interest in the size of the academic web. The main objective of this paper is to propose various methods to estimate the current size (number of indexed documents) of Google Scholar (May 2014) and to determine its validity, precision and reliability. To do this, we present, apply and discuss three empirical methods: an external estimate based on empirical studies of Google Scholar coverage, and two internal estimate methods based on direct, empty and absurd queries, respectively. The results, despite providing disparate values, place the estimated size of Google Scholar at around 160–165 million documents. However, all the methods show considerable limitations and uncertainties due to inconsistencies in the Google Scholar search functionalities.
Similar content being viewed by others
Notes
The Custom range option appears after a query is submitted in the search box of Google Scholar. The user can also access to the advanced search option to set the year range. Moreover, we can execute this query directly on the browser via http as well. Once we obtain the first results via hit count estimates, we can generate new queries without introducing any keyword in the search box, and only selecting the time span required. This is the procedure followed in this study.
Additional information about the biases of WoS towards English and article document type is available in the supplementary material (Appendix V).
References
Adamic, I. A., & Huberman, B. A. (2001). The web’s hidden order. Communications of the ACM, 44(9), 55–59.
Adar, E., Teevan, J., & Dumais, S. T. (2009). Resonance on the web: Web dynamics and revisitation patterns. In Proceedings of the SIGCHI conference on human factors in computing systems (pp.1381–1390).
Aguillo, I. F. (2012). Is Google Scholar useful for bibliometrics? A webometric analysis. Scientometrics, 91(2), 343–351.
Albert, R., Jeong, H., & Barabasi, A. L. (1999). Internet—Diameter of the world-wide web. Nature, 401(6749), 130–131.
Barabasi, A.-L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 282(5439), 509–512.
Berman, Elizabeth P. (2012). Creating the market university: How academic science became an economic engine. New Jersey: Princeton University Press.
Brewington, B. E., & Cybenko, G. (2000). How dynamic is the Web? Computer Networks, 33(1–6), 257–276.
Cothran, T. (2011). Google Scholar acceptance and use among graduate students: A quantitative study. Library and Information Science Research, 33(4), 293–301.
Delgado López-Cózar, E., & Cabezas-Clavijo, Á. (2013). Ranking journals: Could Google scholar metrics be an alternative to journal citation reports and Scimago journal rank? Learned Publishing, 26(2), 101–113.
De Winter, J. C. F., Zadpoor, A., & Dodou, D. (2014). The expansion of Google Scholar versus Web of Science: A longitudinal study. Scientometrics, 98(2), 1547–1565.
Dobra, A., & Fienberg, S. E. (2004). How large is the world wide web?. In M. Levene & A. Poulovassilis (Eds.), Web dynamics (pp. 23–43). Springer: Berlin.
Harzing, A.-W. (2014). A longitudinal study of Google Scholar coverage between 2012 and 2013. Scientometrics, 98(1), 565–575.
Jacsó, P. (2006). Dubious hit counts and cuckoo’s eggs. Online Information Review, 30(2), 188–193.
Jacsó, P. (2008). Google scholar revisited. Online Information Review, 32(1), 102–114.
Jacsó, P. (2011). The pros and cons of Microsoft Academic Search from a bibliometric perspective. Online Information Review, 35(6), 983–997.
Joint Information Systems Committee (2012). Researchers of tomorrow: The research behaviour of generation Y doctoral students. http://www.jisc.ac.uk/media/documents/publications/reports/2012/Researchers-of-Tomorrow.pdf. Accessed 10 October 2014.
Khabsa, M., & Giles, C. L. (2014). The number of scholarly documents on the public web. PLoS ONE, 9(5), e93949.
Koehler, W. (1999). An analysis of web page and web site constancy and permanence. Journal of the American Society for Information Science, 50(2), 162–180.
Koehler, W. (2002). Web page change and persistence—A four-year longitudinal study. Journal of the American Society for Information Science and Technology, 53(2), 162–171.
Koehler, W. (2004). A longitudinal study of Web pages continued a consideration of document persistence. Information Research, 9(2). http://informationr.net/ir/9-2/paper174.html. Accessed 10 October 2014.
Kousha, K., & Thelwall, M. (2008). Sources of Google Scholar citations outside the Science Citation Index: A comparison between four science disciplines. Scientometrics, 74(2), 273–294.
Lawrence, S., & Giles, C. (1998). Searching the world wide web. Science, 280(5360), 98–100.
Lawrence, S., & Giles, C. L. (1999). Accessibility of information on the web. Nature, 400(6740), 107.
Levene, M., Fenner, T., Loizou, G., & Wheeldon, R. (2002). A stochastic model for the evolution of the web. Computer Networks, 39(3), 277–287.
Martín-Martín, A., Orduna-Malea, E., Ayllón, J. M., & Delgado López-Cózar, E. (2014). Does Google Scholar contain all highly cited documents (1950–2013)? Granada: EC3 Working Papers, 19. http://arxiv.org/abs/1410.8464. Accessed 20 March 2015.
Meho, L. I., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty: Web of Science versus Scopus and Google Scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125.
Miri, S. M., Raoofi, A., & Heidari, Z. (2012). Citation analysis of hepatitis monthly by journal citation report (ISI), Google Scholar, and Scopus. Hepatitis Monthly, 12(9), e7441.
Orduna-Malea, E., & Delgado López-Cózar, E. (2014). Google Scholar Metrics evolution: An analysis according to languages. Scientometrics, 98(3), 2353–2367.
Orduna-Malea, E., Martin-Martin, A., Ayllón, Juan M., & Delgado López-Cózar, E. (2014). The silent fading of an academic search engine: The case of Microsoft Academic Search. Online Information Review, 38(7), 936–953.
Orduna-Malea, E., Serrano-Cobos, J., & Lloret-Romero, N. (2009). Las universidades públicas españolas en Google Scholar: Presencia y evolución de su publicación académica web. El profesional de la información, 18(5), 493–500.
Ortega, J. L. (2014). Academic search engines: A quantitative outlook. Netherlands: Elsevier. Chandos Information Professional Series.
Ortega, J. L., Aguillo, I., & Prieto, J. A. (2006). Longitudinal study of content and elements in the scientific web environment. Journal of Information Science, 32(4), 344–351.
Payne, N., & Thelwall, M. (2007). A longitudinal study of academic webs: Growth and stabilisation. Scientometrics, 71(3), 523–539.
Payne, N., & Thelwall, M. (2008a). Do academic link types change over time? Journal of Documentation, 64(5), 707–720.
Payne, N., & Thelwall, M. (2008b). Longitudinal trends in academic web links. Journal of Information Science, 34(1), 3–14.
Uyar, A. (2009). Investigation of the accuracy of search engine hit counts. Journal of Information Science, 35(4), 469–480.
Van Noorden, R. (2014). Scientists and the social network. Nature, 512(7513), 126–129.
Wilkinson, D., & Thelwall, M. (2013). Search markets and search results: The case of Bing. Library and Information Science Research, 35(4), 318–325.
Yang, K., & Meho, L. I. (2006). Citation Analysis: A Comparison of Google Scholar, Scopus, and Web of Science. Proceedings of the American Society for Information Science and Technology, 43(1), 1–15.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Orduna-Malea, E., Ayllón, J.M., Martín-Martín, A. et al. Methods for estimating the size of Google Scholar. Scientometrics 104, 931–949 (2015). https://doi.org/10.1007/s11192-015-1614-6
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-015-1614-6