A Quantitative Evaluation of Dissemination-Time Preservation Metadata

Joan A. Smith¹ &
Michael L. Nelson¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5173))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1225 Accesses
2 Citations
6 Altmetric

Abstract

One of many challenges facing web preservation efforts is the lack of metadata available for web resources. In prior work, we proposed a model that takes advantage of a site’s own web server to prepare its resources for preservation. When responding to a request from an archiving repository, the server applies a series of metadata utilities, such as Jhove and Exif, to the requested resource. The output from each utility is included in the HTTP response along with the resource itself. This paper addresses the question of feasibility: Is it in fact practical to use the site’s web server as a just-in-time metadata generator, or does the extra processing create an unacceptable deterioration in server responsiveness to quotidian events? Our tests indicate that (a) this approach can work effectively for both the crawler and the server; and that (b) utility selection is an important factor in overall performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

CDX Summary: Web Archival Collection Insights

Metadata and Preservation

References

Nelson, M.L., Smith, J.A., Van de Sompel, H., Liu, X., Garcia del Campo, I.: Efficient, automatic web resource harvesting. In: 7th ACM WIDM, pp. 43–50 (November 2006)
Google Scholar
Smith, J.A., Nelson, M.L.: CRATE: A simple model for self-describing web resources. In: IWAW 2007 (June 2007)
Google Scholar
Lyman, P., Varian, H.R., Charles, P., Good, N., Jordan, L.L., Pal, J.: How much information? 2003. Research Project Report, U.C. Berkeley School of Information Management and Systems (October 2003), http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
Baeza-Yates, R., Castillo, C., Efthimiadis, E.N.: Characterization of national web domains. ACM TOIT 7(2) (2007)
Google Scholar
Levering, R., Cutler, M.: The portrait of a common HTML web page. In: ACM DocEng 2006, pp. 198–204 (October 2006)
Google Scholar
Fetterly, D., Manasse, M., Najork, M., Wiener, J.L.: A large-scale study of the evolution of web pages. Software: Practice & Experience 34(2), 213–237 (2004)
Article Google Scholar
Bent, L., Rabinovich, M., Voelker, G.M., Xiao, Z.: Characterization of a large web site population with implications for content delivery. In: WWW 2004, pp. 522–533 (December 2004)
Google Scholar
Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: The evolution of the web from a search engine perspective. In: WWW 2004, pp. 1–12 (December 2004)
Google Scholar
Cherkasova, L., Karlsson, M.: Dynamics and evolution of web sites: Analysis, metrics, and design issues. In: IEEE ISCC, pp. 64–71 (July 2001)
Google Scholar
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1-7), 161–172 (1998)
Article Google Scholar
Van de Sompel, H., Nelson, M.L., Lagoze, C., Warner, S.: Resource harvesting within the OAI-PMH framework. D-Lib Magazine 10(12) (December 2004)
Google Scholar
Bekaert, J., De Kooning, E., Van de Sompel, H.: Representing digital assets using MPEG-21 Digital Item Declaration. Int. J. Digit. Libr. 6(2), 159–173 (2006)
Article Google Scholar
Scott, M.: Wordsmith software package. Oxford University Press, Oxford (2008), http://www.lexically.net/wordsmith/
Google Scholar
Hardy, D.R., Schwartz, M.F.: Customized information extraction as a basis for resource discovery. ACM Trans. Comput. Syst. 14(2), 171–199 (1996)
Article Google Scholar
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: practical automatic keyphrase extraction. In: ACM DL 1999, pp. 254–255 (August 1999)
Google Scholar

Download references

Author information

Authors and Affiliations

C.S. Dept, Old Dominion University, Norfolk, VA 23529
Joan A. Smith & Michael L. Nelson

Authors

Joan A. Smith
View author publications
You can also search for this author in PubMed Google Scholar
Michael L. Nelson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Birte Christensen-Dalsgaard Donatella Castelli Bolette Ammitzbøll Jurik Joan Lippincott

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Smith, J.A., Nelson, M.L. (2008). A Quantitative Evaluation of Dissemination-Time Preservation Metadata. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2008. Lecture Notes in Computer Science, vol 5173. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87599-4_36

Download citation

DOI: https://doi.org/10.1007/978-3-540-87599-4_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87598-7
Online ISBN: 978-3-540-87599-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Quantitative Evaluation of Dissemination-Time Preservation Metadata

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

CDX Summary: Web Archival Collection Insights

Metadata and Preservation

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Quantitative Evaluation of Dissemination-Time Preservation Metadata

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

CDX Summary: Web Archival Collection Insights

Metadata and Preservation

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation