Abstract
Scientists, governments, and companies increasingly publish datasets on the Web. Google’s Dataset Search extracts dataset metadata—expressed using schema.org and similar vocabularies—from Web pages in order to make datasets discoverable. Since we started the work on Dataset Search in 2016, the number of datasets described in schema.org has grown from 500K to almost 30M. Thus, this corpus has become a valuable snapshot of data on the Web. To the best of our knowledge, this corpus is the largest and most diverse of its kind. We analyze this corpus and discuss where the datasets originate from, what topics they cover, which form they take, and what people searching for datasets are interested in. Based on this analysis, we identify gaps and possible future work to help make data more discoverable.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Here and elsewhere, “domain” refers to “internet domain.”
- 3.
References
Ben Ellefi, M., et al.: RDF dataset profiling–a survey of features, methods, vocabularies and applications. Semant. Web 9(5), 677–705 (2018)
Carbon, S., Champieux, R., McMurry, J.A., Winfree, L., Wyatt, L.R., Haendel, M.A.: An analysis and metric of reusable data licensing practices for biomedical resources. PLOS ONE 14(3) (2019). https://doi.org/10.1371/journal.pone.0213090
Chapman, A., et al.: Dataset search: a survey. VLDB J. 29(1), 251–272 (2019). https://doi.org/10.1007/s00778-019-00564-x
Fenner, M., Crosas, M., Grethe, J., et al.: A data citation roadmap for scholarly data repositories. bioRxiv (2017). https://doi.org/10.1101/097196
Datasets: Search for developers. https://developers.google.com/search/docs/data-types/dataset
Gray, A.J., Goble, C.A., Jimenez, R.: Bioschemas: from potato salad to protein annotation. In: International Semantic Web Conference (Posters, Demos & Industry Tracks) (2017)
Gregory, K., Groth, P., Scharnhorst, A., Wyatt, S.: Lost or found? Discovering data needed for research. Harvard Data Sci. Rev. (2020). https://doi.org/10.1162/99608f92.e38165eb
Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the web. Commun. ACM 59(2), 44–51 (2016)
Halevy, A., et al.: Goods: organizing Google’s datasets. In: ACM SIGMOD (2016)
Hendler, J., Holm, J., Musialek, C., Thomas, G.: US government linked open data: Semantic.data.gov. IEEE Intell. Syst. 27(3), 25–31 (2012). https://doi.org/10.1109/MIS.2012.27
Herschel, M., Diestelkämper, R., Lahmar, H.B.: A survey on provenance: what for? what form? what from? VLDB J. 26(6), 881–906 (2017)
Kindling, M., et al.: The landscape of research data repositories in 2015: a re3data analysis. D-Lib Mag. 23(3/4) (2017). https://doi.org/10.1045/march2017-kindling
Meusel, R., Bizer, C., Paulheim, H.: A web-scale study of the adoption and evolution of the schema.org vocabulary over time. In: International Conference on Web Intelligence, Mining and Semantics. ACM, New York (2015). https://doi.org/10.1145/2797115.2797124
Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. VLDB J. 11(7) (2018). https://doi.org/10.14778/3192965.3192973
Nature scientific data (2018). https://www.nature.com/sdata
Noy, N., Burgess, M., Brickley, D.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The Web Conference, pp. 1365–1375. ACM (2019). https://doi.org/10.1145/3308558.3313685
Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A., Taylor, J.: Industry-scale knowledge graphs: lessons and challenges. Commun. ACM 62(8), 36–43 (2019). https://doi.org/10.1145/3331166
RDF 1.1 Concepts and Abstract Syntax. https://www.w3.org/TR/rdf11-concepts/
Rueda, L., Fenner, M., Cruse, P.: Datacite: lessons learned on persistent identifiers for research data. IJDC 11(2), 39–47 (2016). https://doi.org/10.2218/ijdc.v11i2.421
Sansone, S.A., et al.: DATS, the data tag suite to enable discoverability of datasets. Sci. Data 4, 170059 (2017)
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 245–260. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_16
Stall, S., et al.: Make scientific data FAIR (2019)
Vrandečić, D.: Describing datasets in Wikidata. In: Advanced Knowledge Technologies for Science in a FAIR World, IEEE eScience Conference (2019)
Wang, J., Aryani, A., Wyborn, L., Evans, B.: Providing research graph data in JSON-LD Using Schema.org. In: 26th International Conference on World Wide Web Companion, pp. 1213–1218 (2017). https://doi.org/10.1145/3041021.3053052
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016)
Wimalaratne, S.M., Juty, N., Kunze, J., Janée, G., et al.: Uniform resolution of compact identifiers for biomedical data. Sci. Data 5, 180029 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Benjelloun, O., Chen, S., Noy, N. (2020). Google Dataset Search by the Numbers. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_41
Download citation
DOI: https://doi.org/10.1007/978-3-030-62466-8_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62465-1
Online ISBN: 978-3-030-62466-8
eBook Packages: Computer ScienceComputer Science (R0)