Abstract
There’s more and more precious content digitized in digital archives especially for cultural heritage. It could cost much effort in digitization and archiving. To meet the requirements in a digital archiving system, several issues must be addressed. First, it usually requires resources such as computation and storage for each individual digital archive to maintain its own service. Second, the archived content would be more useful if they can be easily utilized in providing services such as searching across multiple archives. Current approaches usually adopt metadata harvesting that would build a centralized index from separate digital libraries. They usually suffer from the problem of metadata inconsistency. In this paper, we propose a distributed indexing approach to collaborative content-based multimedia retrieval across digital archives. To reduce the loads in each archive, we dynamically distribute the tasks of crawling, indexing, and query processing depending on the response time. Distributed crawler-based approach can simplify the design of indexing and query processing steps by maintaining the data to be indexed local to the machine for crawling. It can facilitate efficient archiving and indexing by automatically following the link structure of contents published on the Web. Also, it enables simpler implementation and easier support for cross-archive applications such as search and copy detection. Experimental results show the potential of the proposed approach in load balancing with appropriate task distribution.
Similar content being viewed by others
Notes
http://www.ndap.org.tw/, the homepage for the first phase of the digital archives project in Taiwan. The second phase projects are available in: http://www.teldap.tw/
References
Banbridge D, Don K, Buchanan G, Witten I, Jones S, Jones M, Barr M (2004) In Proceedings of the 8th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2004) pp 1–13
Bender M, Michel S, Triantafillou P, Weikum G, Zimmer C (2005) Improving collection selection with overlap awareness in P2P search engines. In Proceedings of SIGIR 2005, pp 67–74
Boldi P, Codenotti B, Santini M, Vigna S (2004) UbiCrawler: a scalable fully distributed Web crawler. Softw Pract Experience 34(8):711–726
Buchanan G, Bainbridge D, Don K, Witten I (2005) A new framework for building digital library collections. In Proceedings of ACM/IEEE Joint Conference on Digital Libraries (JCDL 2005), pp 23–31
Callan J (2002) Distributed information retrieval. In Advances in information retrieval. pp 127–150
Cho J, Garcia-Molina H (2002) Parallel crawlers. In Proceedings of the 11th World Wide Web conference (WWW 2002), pp 124–135
Efron M, Organisciak P, Fenlon K (2011) Building topic models in a federated digital library through selective document exclusion. Proc Am Soc Info Sci Tech 48:1–10. doi:10.1002/meet.2011.14504801048
Heydon A, Najork M (1999) Mercator: a scalable, extensible web crawler. World Wide Web 2(4):219–229. Available at http://link.springer.com/article/10.1023%2FA%3A1019213109274
Lagoze C, Sompel HV, Nelson M, Warner S The open archives initiative protocol for metadata harvesting (2.0). Public draft, available at http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
Liu X, Maly K, Zubair M, Nelson ML (2003) Repository synchronization in the OAI framework. In Proceedings of the Joint Conference on Digital Libraries (JCDL 2003), pp 191–198
Lu J, Callan J (2003) Content-based retrieval in hybrid peer-to-peer networks. In Proceedings of the twelfth International Conference on Information and Knowledge Management (CIKM 2003), pp 199–206
Lu J, Callan J (2005) Federated search of text-based digital libraries in hierarchical peer-to-peer networks. In Proceedings of 27th European Conference on Information Retrieval Research (ECIR 2005), pp 52–66
Maniatis P, Roussopoulos M, Giuli T, Rosenthal D, Baker M (2005) The LOCKSS peer-to-peer digital preservation system. ACM Trans Comput Syst 23(1):2–50
Payette S, Lagoze C (1998) Flexible and Extensible Digital Object and Repository Architecture (FEDORA). In Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries (ECDL 1998), pp 41–59
Seara EFR, Sunye MS, Bona LCE, Vignatti T, Vignatti AL, Doucet A (2012) Extending OAI-PMH over structured P2P networks for digital preservation. Int J Digit Libr 12:13–26
Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In Proceedings of the International Conference on Data Engineering (ICDE 2002), pp 357–368
Simeoni F, Yakici M, Neely S, Crestani F (2008) Metadata harvesting for content-based distributed information retrieval. J Am Soc Inf Sci Technol 59(1):12–24
Singh A, Srivatsa M, Liu L, Miller T (2003) Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web. In Proceedings the SIGIR 2003 Workshop on Distributed IR, LNCS 2924. pp 126–142
Smith M, Barton M, Bass M, Branschofsky M, McClellan G, Stuve D, Tansley R, Walker JH (2003) DSpace: an open source dynamic digital repository. D-Lib Mag 9(No.1)
Staples T, Wayland R, Payette S (2003) The fedora project: an open-source digital object repository management system. D-Lib Mag 9(No. 4)
Stribling J, Councill I, Li J, Kaashoek M, Karger D, Morris R, Shenker S (2005) OverCite: A cooperative digital research library. In Proceedings of the 4th International Workshop on Peer-to-Peer Systems (IPTPS 2005), pp 69–79
Suel T, Mathur C, Wu J, Zhang J, Delis A, Kharrazi M, Long X, Shanmugasundaram K (2003) ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In Proceedings of the 6th International Workshop on the Web and Database (WebDB 2003), pp 67–72
Teregowda P, Urgaonkar B, Giles CL (2010) Cloud computing: A digital libraries perspective. In Proceedings of IEEE 3rd International Conference on Cloud Computing (Cloud 2010), pp 115–122
Trnkoczy J, Stankovski V (2008) Improving the performance of federated digital library services. Futur Gener Comput Syst 24:824–832
Trnkoczy J, Turk Z, Stankovski V (2006) A grid-based architecture for personalized federation of digital libraries. Libr Collect Acquis Tech Serv 30:139–153
Vignatti T, Bona LCE, Sunye MS (2009) Long-term digital archiving based on selection of repositories over P2P networks. In Proceedings of IEEE 9th International Conference on Peer-to-Peer Computing (P2P 2009), pp 194–203
Wang JH, Chang HC, Hsiao JH (2008) Protecting digital library collections with collaborative web image copy detection. In Proceedings of the 11th International Conference on Asian Digital Libraries (ICADL 2008), pp 332–335
Wittek P, Daranyi S (2011) Leveraging on high-performance computing and cloud technologies in digital libraries: A case study. In Proceedings of IEEE 3rd International Conference on Cloud Computing Technology and Science (CloudCom 2011), pp 606–611
Acknowledgment
We would like to thank the support from National Science Council, Taiwan under the grant number NSC101-2219-E-027-005.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, JH., Chang, HC. CoBITs: a distributed indexing approach to collaborative content-based multimedia retrieval across digital archives. Multimed Tools Appl 74, 2639–2658 (2015). https://doi.org/10.1007/s11042-013-1461-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1461-5