Google Dataset Search by the Numbers

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12507))

Included in the following conference series:

International Semantic Web Conference

4099 Accesses
23 Citations
3 Altmetric

Abstract

Scientists, governments, and companies increasingly publish datasets on the Web. Google’s Dataset Search extracts dataset metadata—expressed using schema.org and similar vocabularies—from Web pages in order to make datasets discoverable. Since we started the work on Dataset Search in 2016, the number of datasets described in schema.org has grown from 500K to almost 30M. Thus, this corpus has become a valuable snapshot of data on the Web. To the best of our knowledge, this corpus is the largest and most diverse of its kind. We analyze this corpus and discuss where the datasets originate from, what topics they cover, which form they take, and what people searching for datasets are interested in. Based on this analysis, we identify gaps and possible future work to help make data more discoverable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages

Dataset search: a survey

Article Open access 24 August 2019

A Taxonomy of Dataset Search

Notes

1.
http://datasetsearch.research.google.com.
2.
Here and elsewhere, “domain” refers to “internet domain.”
3.
http://datasearch.elsevier.com/.

References

Ben Ellefi, M., et al.: RDF dataset profiling–a survey of features, methods, vocabularies and applications. Semant. Web 9(5), 677–705 (2018)
Article Google Scholar
Carbon, S., Champieux, R., McMurry, J.A., Winfree, L., Wyatt, L.R., Haendel, M.A.: An analysis and metric of reusable data licensing practices for biomedical resources. PLOS ONE 14(3) (2019). https://doi.org/10.1371/journal.pone.0213090
Chapman, A., et al.: Dataset search: a survey. VLDB J. 29(1), 251–272 (2019). https://doi.org/10.1007/s00778-019-00564-x
Article Google Scholar
Fenner, M., Crosas, M., Grethe, J., et al.: A data citation roadmap for scholarly data repositories. bioRxiv (2017). https://doi.org/10.1101/097196
Datasets: Search for developers. https://developers.google.com/search/docs/data-types/dataset
Gray, A.J., Goble, C.A., Jimenez, R.: Bioschemas: from potato salad to protein annotation. In: International Semantic Web Conference (Posters, Demos & Industry Tracks) (2017)
Google Scholar
Gregory, K., Groth, P., Scharnhorst, A., Wyatt, S.: Lost or found? Discovering data needed for research. Harvard Data Sci. Rev. (2020). https://doi.org/10.1162/99608f92.e38165eb
Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the web. Commun. ACM 59(2), 44–51 (2016)
Article Google Scholar
Halevy, A., et al.: Goods: organizing Google’s datasets. In: ACM SIGMOD (2016)
Google Scholar
Hendler, J., Holm, J., Musialek, C., Thomas, G.: US government linked open data: Semantic.data.gov. IEEE Intell. Syst. 27(3), 25–31 (2012). https://doi.org/10.1109/MIS.2012.27
Herschel, M., Diestelkämper, R., Lahmar, H.B.: A survey on provenance: what for? what form? what from? VLDB J. 26(6), 881–906 (2017)
Article Google Scholar
Kindling, M., et al.: The landscape of research data repositories in 2015: a re3data analysis. D-Lib Mag. 23(3/4) (2017). https://doi.org/10.1045/march2017-kindling
Meusel, R., Bizer, C., Paulheim, H.: A web-scale study of the adoption and evolution of the schema.org vocabulary over time. In: International Conference on Web Intelligence, Mining and Semantics. ACM, New York (2015). https://doi.org/10.1145/2797115.2797124
Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. VLDB J. 11(7) (2018). https://doi.org/10.14778/3192965.3192973
Nature scientific data (2018). https://www.nature.com/sdata
Noy, N., Burgess, M., Brickley, D.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The Web Conference, pp. 1365–1375. ACM (2019). https://doi.org/10.1145/3308558.3313685
Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A., Taylor, J.: Industry-scale knowledge graphs: lessons and challenges. Commun. ACM 62(8), 36–43 (2019). https://doi.org/10.1145/3331166
Article Google Scholar
RDF 1.1 Concepts and Abstract Syntax. https://www.w3.org/TR/rdf11-concepts/
Rueda, L., Fenner, M., Cruse, P.: Datacite: lessons learned on persistent identifiers for research data. IJDC 11(2), 39–47 (2016). https://doi.org/10.2218/ijdc.v11i2.421
Article Google Scholar
Sansone, S.A., et al.: DATS, the data tag suite to enable discoverability of datasets. Sci. Data 4, 170059 (2017)
Article Google Scholar
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 245–260. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_16
Chapter Google Scholar
Stall, S., et al.: Make scientific data FAIR (2019)
Google Scholar
Vrandečić, D.: Describing datasets in Wikidata. In: Advanced Knowledge Technologies for Science in a FAIR World, IEEE eScience Conference (2019)
Google Scholar
Wang, J., Aryani, A., Wyborn, L., Evans, B.: Providing research graph data in JSON-LD Using Schema.org. In: 26th International Conference on World Wide Web Companion, pp. 1213–1218 (2017). https://doi.org/10.1145/3041021.3053052
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016)
Article Google Scholar
Wimalaratne, S.M., Juty, N., Kunze, J., Janée, G., et al.: Uniform resolution of compact identifiers for biomedical data. Sci. Data 5, 180029 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Google Research, New York, USA
Omar Benjelloun, Shiyu Chen & Natasha Noy

Authors

Omar Benjelloun
View author publications
You can also search for this author in PubMed Google Scholar
Shiyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Natasha Noy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natasha Noy .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, UK
Jeff Z. Pan
University of Liverpool, Liverpool, UK
Valentina Tamma
University of Bari, Bari, Italy
Claudia d’Amato
University of California, Santa Barbara, Santa Barbara, CA, USA
Krzysztof Janowicz
California State University, Long Beach, Long Beach, CA, USA
Bo Fu
Vienna University of Economics and Business, Vienna, Austria
Axel Polleres
Rensselaer Polytechnic Institute, Troy, NY, USA
Oshani Seneviratne
Massachusetts Institute of Technology, Cambridge, MA, USA
Lalana Kagal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Benjelloun, O., Chen, S., Noy, N. (2020). Google Dataset Search by the Numbers. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-62466-8_41
Published: 01 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62465-1
Online ISBN: 978-3-030-62466-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)

Google Dataset Search by the Numbers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others