Nothing Special   »   [go: up one dir, main page]

Skip to main content

Google Dataset Search by the Numbers

  • Conference paper
  • First Online:
The Semantic Web – ISWC 2020 (ISWC 2020)

Abstract

Scientists, governments, and companies increasingly publish datasets on the Web. Google’s Dataset Search extracts dataset metadata—expressed using schema.org and similar vocabularies—from Web pages in order to make datasets discoverable. Since we started the work on Dataset Search in 2016, the number of datasets described in schema.org has grown from 500K to almost 30M. Thus, this corpus has become a valuable snapshot of data on the Web. To the best of our knowledge, this corpus is the largest and most diverse of its kind. We analyze this corpus and discuss where the datasets originate from, what topics they cover, which form they take, and what people searching for datasets are interested in. Based on this analysis, we identify gaps and possible future work to help make data more discoverable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://datasetsearch.research.google.com.

  2. 2.

    Here and elsewhere, “domain” refers to “internet domain.”

  3. 3.

    http://datasearch.elsevier.com/.

References

  1. Ben Ellefi, M., et al.: RDF dataset profiling–a survey of features, methods, vocabularies and applications. Semant. Web 9(5), 677–705 (2018)

    Article  Google Scholar 

  2. Carbon, S., Champieux, R., McMurry, J.A., Winfree, L., Wyatt, L.R., Haendel, M.A.: An analysis and metric of reusable data licensing practices for biomedical resources. PLOS ONE 14(3) (2019). https://doi.org/10.1371/journal.pone.0213090

  3. Chapman, A., et al.: Dataset search: a survey. VLDB J. 29(1), 251–272 (2019). https://doi.org/10.1007/s00778-019-00564-x

    Article  Google Scholar 

  4. Fenner, M., Crosas, M., Grethe, J., et al.: A data citation roadmap for scholarly data repositories. bioRxiv (2017). https://doi.org/10.1101/097196

  5. Datasets: Search for developers. https://developers.google.com/search/docs/data-types/dataset

  6. Gray, A.J., Goble, C.A., Jimenez, R.: Bioschemas: from potato salad to protein annotation. In: International Semantic Web Conference (Posters, Demos & Industry Tracks) (2017)

    Google Scholar 

  7. Gregory, K., Groth, P., Scharnhorst, A., Wyatt, S.: Lost or found? Discovering data needed for research. Harvard Data Sci. Rev. (2020). https://doi.org/10.1162/99608f92.e38165eb

  8. Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the web. Commun. ACM 59(2), 44–51 (2016)

    Article  Google Scholar 

  9. Halevy, A., et al.: Goods: organizing Google’s datasets. In: ACM SIGMOD (2016)

    Google Scholar 

  10. Hendler, J., Holm, J., Musialek, C., Thomas, G.: US government linked open data: Semantic.data.gov. IEEE Intell. Syst. 27(3), 25–31 (2012). https://doi.org/10.1109/MIS.2012.27

  11. Herschel, M., Diestelkämper, R., Lahmar, H.B.: A survey on provenance: what for? what form? what from? VLDB J. 26(6), 881–906 (2017)

    Article  Google Scholar 

  12. Kindling, M., et al.: The landscape of research data repositories in 2015: a re3data analysis. D-Lib Mag. 23(3/4) (2017). https://doi.org/10.1045/march2017-kindling

  13. Meusel, R., Bizer, C., Paulheim, H.: A web-scale study of the adoption and evolution of the schema.org vocabulary over time. In: International Conference on Web Intelligence, Mining and Semantics. ACM, New York (2015). https://doi.org/10.1145/2797115.2797124

  14. Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. VLDB J. 11(7) (2018). https://doi.org/10.14778/3192965.3192973

  15. Nature scientific data (2018). https://www.nature.com/sdata

  16. Noy, N., Burgess, M., Brickley, D.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The Web Conference, pp. 1365–1375. ACM (2019). https://doi.org/10.1145/3308558.3313685

  17. Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A., Taylor, J.: Industry-scale knowledge graphs: lessons and challenges. Commun. ACM 62(8), 36–43 (2019). https://doi.org/10.1145/3331166

    Article  Google Scholar 

  18. RDF 1.1 Concepts and Abstract Syntax. https://www.w3.org/TR/rdf11-concepts/

  19. Rueda, L., Fenner, M., Cruse, P.: Datacite: lessons learned on persistent identifiers for research data. IJDC 11(2), 39–47 (2016). https://doi.org/10.2218/ijdc.v11i2.421

    Article  Google Scholar 

  20. Sansone, S.A., et al.: DATS, the data tag suite to enable discoverability of datasets. Sci. Data 4, 170059 (2017)

    Article  Google Scholar 

  21. Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 245–260. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_16

    Chapter  Google Scholar 

  22. Stall, S., et al.: Make scientific data FAIR (2019)

    Google Scholar 

  23. Vrandečić, D.: Describing datasets in Wikidata. In: Advanced Knowledge Technologies for Science in a FAIR World, IEEE eScience Conference (2019)

    Google Scholar 

  24. Wang, J., Aryani, A., Wyborn, L., Evans, B.: Providing research graph data in JSON-LD Using Schema.org. In: 26th International Conference on World Wide Web Companion, pp. 1213–1218 (2017). https://doi.org/10.1145/3041021.3053052

  25. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016)

    Article  Google Scholar 

  26. Wimalaratne, S.M., Juty, N., Kunze, J., Janée, G., et al.: Uniform resolution of compact identifiers for biomedical data. Sci. Data 5, 180029 (2018)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Natasha Noy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Benjelloun, O., Chen, S., Noy, N. (2020). Google Dataset Search by the Numbers. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-62466-8_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-62465-1

  • Online ISBN: 978-3-030-62466-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics