Abstract
This paper presents data mining-based techniques for enabling data integration across deep web data sources. We target query processing across inter-dependent data sources. Thus, besides input-input and output-output matching of attributes, we also need to consider input-output matching. We develop data mining techniques for discovering the instances for querying deep web data sources from the information provided by the query interfaces themselves, as well as from the obtained output pages of the related data sources, by query probing using dynamically identified input instances. Then, using a hierarchical representation of schemas and by applying clustering techniques, we are able to generate schema matches. We show the effectiveness of our technique while integrating 24 query interfaces.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brookes, A.J.: The essence of snps. Gene. 234, 177–186 (1999)
Ashish, N., Knoblock, C.A.: Semi-automatic wrapper generation for internet information sources. In: Proceedings of the Second IFCIS International Conference on Cooperative Information Systems. IEEE Computer Society, Los Alamitos (1997)
Babu, P.A., Boddepalli, R., Lakshmi, V.V., Rao, G.N.: Dod: Database of databases–updated molecular biology databases. Silico Biol. 5 (2005)
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceedings of SDDB (2004)
Bergman, M.K.: The deep web: Surfacing hidden value. Journal of Electronic Publishing 7(1) (August 2001)
Buneman, P., Davidson, S.B., Hart, K., Overton, C., Wong, L.: A data transformation system for biological data sources. In: Proceedings of the Twenty-first International Conference on Very Large Databases (1995)
Callan, J.: Query-based sampling of text databases. ACM Transactions on Information Systems 19, 97–130 (2001)
Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD Conference, pp. 509–520 (2001)
He, B.: Statistical schema matching across web query interfaces. In: SIGMOD Conference, pp. 217–228 (2003)
He, H., Meng, W., Yu, C., Wu, Z.: Wise-integrator: a system for extracting and integrating complex web search interfaces of the deep web. In: VLDB 2005: Proceedings of the 31st international conference on Very large data bases, pp. 1314–1317. VLDB Endowment (2005)
Hern, T., Kambhampati, S.: Integration of biological sources: Current systems and challenges ahead. Sigmod Record 33, 51–60 (2004)
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. The VLDB Journal, 49–58 (2001)
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep Web Crawl. VLDB Endowment 1, 1241–1252 (2008)
Nie, Z., Wen, J.-R., Ma, W.-Y.: Object-level vertical search. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, pp. 235–246 (2007)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(2001) (2001)
Salton, G., Mcgill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York (1986)
Sarma, A.D., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 861–874. ACM, New York (2008)
Wang, F., Agrawal, G., Jin, R.: Query planning for searching inter-dependent deep-web databases. In: Ludäscher, B., Mamoulis, N. (eds.) SSDBM 2008. LNCS, vol. 5069, pp. 24–41. Springer, Heidelberg (2008)
Wang, G., Goguen, J., Nam, Y.k., Lin, K.: Interactive schema matching with semantic functions. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 654–664. Springer, Heidelberg (2004)
Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: VLDB 2004: Proceedings of the Thirtieth international conference on Very large data bases, pp. 408–419. VLDB Endowment (2004)
Wu, W., Doan, A., Yu, C.: Webiq: Learning from the web to match deep-web query interfaces. In: International Conference on Data Engineering, p. 44 (2006)
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD 2004: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 95–106. ACM Press, New York (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, T., Wang, F., Agrawal, G. (2010). Instance Discovery and Schema Matching with Applications to Biological Deep Web Data Integration. In: Lambrix, P., Kemp, G. (eds) Data Integration in the Life Sciences. DILS 2010. Lecture Notes in Computer Science(), vol 6254. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15120-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-15120-0_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15119-4
Online ISBN: 978-3-642-15120-0
eBook Packages: Computer ScienceComputer Science (R0)