research-article

Optimal algorithms for crawling a hidden database in the web

Authors:

Xin JinAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 5, Issue 11

Pages 1112 - 1123

https://doi.org/10.14778/2350229.2350232

Published: 01 July 2012 Publication History

Abstract

A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines.

This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal -- i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.

References

[1]

E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, pages 87--92, 2003.

[2]

M. Alvarez, J. Raposo, A. Pan, F. Cacheda, F. Bellas, and V. Carneiro. Crawling the content hidden behind web forms. In ICCSA, pages 322--333, 2007.

Digital Library

[3]

Z. Bar-Yossef and M. Gurevich. Efficient search engine measurements. In WWW, pages 401--410, 2007.

Digital Library

[4]

Z. Bar-Yossef and M. Gurevich. Mining search engine query logs via suggestion sampling. PVLDB, 1(1):54--65, 2008.

Digital Library

[5]

L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. JIDM, 1(1):133--144, 2010.

[6]

N. Bruno, L. Gravano, and A. Marian. Evaluating top-k queries over web-accessible databases. In ICDE, pages 369--380, 2002.

Digital Library

[7]

A. Cali and D. Martinenghi. Querying the deep web. In EDBT, pages 724--727, 2010.

Digital Library

[8]

J. Callan and M. Connell. Query-based sampling of text databases. ACM TOIS, 19(2):97--130, 2001.

Digital Library

[9]

A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das. Unbiased estimation of size and other aggregates over hidden web databases. In SIGMOD, pages 855--866, 2010.

Digital Library

[10]

E. Dragut, T. Kabisch, C. Yu, and U. Leser. A hierarchical approach to model web query interfaces for web source integration. PVLDB, 2(1):325--336, 2009.

Digital Library

[11]

E. Dragut, C. Yu, and W. Meng. Meaningful labeling of integrated query interfaces. In VLDB, pages 679--690, 2006.

Digital Library

[12]

B. He and K. Chang. Statistical schema matching across web query interfaces. In SIGMOD, pages 217--228, 2003.

Digital Library

[13]

B. He, K. Chang, and J. Han. Discovering complex matchings across web query interfaces: A correlation mining approach. In KDD, pages 148--157, 2004.

Digital Library

[14]

P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, pages 394--405, 2002.

Digital Library

[15]

X. Jin, N. Zhang, and G. Das. Attribute domain discovery for hidden web databases. In SIGMOD, pages 553--564, 2011.

Digital Library

[16]

S. Liddle, D. Embley, D. Scott, and S. Yau. Extracting data behind web forms. In ER (Workshops), pages 402--413, 2002.

[17]

J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep-web crawl. PVLDB, 1(2):1241--1252, 2008.

Digital Library

[18]

A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In JCDL, pages 100--109, 2005.

Digital Library

[19]

S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB, pages 129--138, 2001.

Digital Library

[20]

K. Vieira, L. Barbosa, J. Freire, and A. Soares da Silva. Siphon++: a hidden-webcrawler for keyword-based interfaces. In CIKM, pages 1361--1362, 2008.

Digital Library

[21]

Z. Zhang, B. He, and K. Chang. Understanding web query interfaces: best-effort parsing with hidden syntax. In SIGMOD, pages 107--118, 2004.

Digital Library

Cited By

Madan KBhatia R(2021)Ranked Deep Web Page Detection Using Reinforcement Learning and Query OptimizationInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.202110010617:4(99-121)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.4018/IJSWIS.2021100106
Nargesian FAsudeh AJagadish H(2021)Tailoring data source distributions for fairness-aware data integrationProceedings of the VLDB Endowment10.14778/3476249.347629914:11(2519-2532)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476299
Hintzen SLiesy YZirpins C(2021)A third-party replication service for dynamic hidden databasesService Oriented Computing and Applications10.1007/s11761-020-00313-x15:4(323-338)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s11761-020-00313-x
Show More Cited By

Recommendations

Topic-Sensitive hidden-web crawling
WISE'12: Proceedings of the 13th international conference on Web Information Systems Engineering

A constantly growing amount of high-quality information is stored in pages coming from the Hidden Web. Such pages are accessible only through a query interface that a Hidden-Web site provides and may span a variety of topics.

In order to provide ...
Crawling and searching the hidden web
Focused crawling for the hidden web

A constantly growing amount of high-quality information resides in databases and is guarded behind forms that users fill out and submit. The Hidden Web comprises all these information sources that conventional web crawlers are incapable of discovering. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 5, Issue 11

July 2012

608 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2012

Published in PVLDB Volume 5, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
192
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Madan KBhatia R(2021)Ranked Deep Web Page Detection Using Reinforcement Learning and Query OptimizationInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.202110010617:4(99-121)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.4018/IJSWIS.2021100106
Nargesian FAsudeh AJagadish H(2021)Tailoring data source distributions for fairness-aware data integrationProceedings of the VLDB Endowment10.14778/3476249.347629914:11(2519-2532)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476299
Hintzen SLiesy YZirpins C(2021)A third-party replication service for dynamic hidden databasesService Oriented Computing and Applications10.1007/s11761-020-00313-x15:4(323-338)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s11761-020-00313-x
Gao YZhang NSong H(2019)Social Security and Privacy for Social IoT Polymorphic Value SetSecurity and Communication Networks10.1155/2019/54983752019Online publication date: 28-Aug-2019
https://dl.acm.org/doi/10.1155/2019/5498375
Rekatsinas TDeshpande AParameswaran AZhu WTao DCheng XCui PRundensteiner ECarmel DHe QXu Yu J(2019)CRUXProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357976(841-850)Online publication date: 3-Nov-2019
https://dl.acm.org/doi/10.1145/3357384.3357976
Kumar SGao HWang CChang KSundaram H(2019)Hierarchical multi-armed bandits for discovering hidden populationsProceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1145/3341161.3342880(145-153)Online publication date: 27-Aug-2019
https://dl.acm.org/doi/10.1145/3341161.3342880
Wang PShea RWang JWu EBoncz PManegold SAilamaki ADeshpande AKraska T(2019)Progressive Deep Web Crawling Through Keyword Queries For Data EnrichmentProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319899(229-246)Online publication date: 25-Jun-2019
https://dl.acm.org/doi/10.1145/3299869.3319899
Hernández IRivero CRuiz D(2019)Deep Web crawlingWorld Wide Web10.1007/s11280-018-0602-122:4(1577-1610)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1007/s11280-018-0602-1
Zhou ZZhao PSheng VXu JLi ZWu JCui Z(2018)Efficient sampling methods for characterizing POIs on maps based on road networksFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6146-612:3(582-592)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s11704-016-6146-6
(2017)Result Merging for Structured Queries on the Deep Web with Active Relevance Weight EstimationInformation Systems10.1016/j.is.2016.06.00564:C(93-103)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1016/j.is.2016.06.005
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents