Abstract
One of many challenges facing web preservation efforts is the lack of metadata available for web resources. In prior work, we proposed a model that takes advantage of a site’s own web server to prepare its resources for preservation. When responding to a request from an archiving repository, the server applies a series of metadata utilities, such as Jhove and Exif, to the requested resource. The output from each utility is included in the HTTP response along with the resource itself. This paper addresses the question of feasibility: Is it in fact practical to use the site’s web server as a just-in-time metadata generator, or does the extra processing create an unacceptable deterioration in server responsiveness to quotidian events? Our tests indicate that (a) this approach can work effectively for both the crawler and the server; and that (b) utility selection is an important factor in overall performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Nelson, M.L., Smith, J.A., Van de Sompel, H., Liu, X., Garcia del Campo, I.: Efficient, automatic web resource harvesting. In: 7th ACM WIDM, pp. 43–50 (November 2006)
Smith, J.A., Nelson, M.L.: CRATE: A simple model for self-describing web resources. In: IWAW 2007 (June 2007)
Lyman, P., Varian, H.R., Charles, P., Good, N., Jordan, L.L., Pal, J.: How much information? 2003. Research Project Report, U.C. Berkeley School of Information Management and Systems (October 2003), http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
Baeza-Yates, R., Castillo, C., Efthimiadis, E.N.: Characterization of national web domains. ACM TOIT 7(2) (2007)
Levering, R., Cutler, M.: The portrait of a common HTML web page. In: ACM DocEng 2006, pp. 198–204 (October 2006)
Fetterly, D., Manasse, M., Najork, M., Wiener, J.L.: A large-scale study of the evolution of web pages. Software: Practice & Experience 34(2), 213–237 (2004)
Bent, L., Rabinovich, M., Voelker, G.M., Xiao, Z.: Characterization of a large web site population with implications for content delivery. In: WWW 2004, pp. 522–533 (December 2004)
Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: The evolution of the web from a search engine perspective. In: WWW 2004, pp. 1–12 (December 2004)
Cherkasova, L., Karlsson, M.: Dynamics and evolution of web sites: Analysis, metrics, and design issues. In: IEEE ISCC, pp. 64–71 (July 2001)
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1-7), 161–172 (1998)
Van de Sompel, H., Nelson, M.L., Lagoze, C., Warner, S.: Resource harvesting within the OAI-PMH framework. D-Lib Magazine 10(12) (December 2004)
Bekaert, J., De Kooning, E., Van de Sompel, H.: Representing digital assets using MPEG-21 Digital Item Declaration. Int. J. Digit. Libr. 6(2), 159–173 (2006)
Scott, M.: Wordsmith software package. Oxford University Press, Oxford (2008), http://www.lexically.net/wordsmith/
Hardy, D.R., Schwartz, M.F.: Customized information extraction as a basis for resource discovery. ACM Trans. Comput. Syst. 14(2), 171–199 (1996)
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: practical automatic keyphrase extraction. In: ACM DL 1999, pp. 254–255 (August 1999)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Smith, J.A., Nelson, M.L. (2008). A Quantitative Evaluation of Dissemination-Time Preservation Metadata. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2008. Lecture Notes in Computer Science, vol 5173. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87599-4_36
Download citation
DOI: https://doi.org/10.1007/978-3-540-87599-4_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87598-7
Online ISBN: 978-3-540-87599-4
eBook Packages: Computer ScienceComputer Science (R0)