Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Optimal algorithms for crawling a hidden database in the web

Published: 01 July 2012 Publication History

Abstract

A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines.
This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal -- i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.

References

[1]
E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, pages 87--92, 2003.
[2]
M. Alvarez, J. Raposo, A. Pan, F. Cacheda, F. Bellas, and V. Carneiro. Crawling the content hidden behind web forms. In ICCSA, pages 322--333, 2007.
[3]
Z. Bar-Yossef and M. Gurevich. Efficient search engine measurements. In WWW, pages 401--410, 2007.
[4]
Z. Bar-Yossef and M. Gurevich. Mining search engine query logs via suggestion sampling. PVLDB, 1(1):54--65, 2008.
[5]
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. JIDM, 1(1):133--144, 2010.
[6]
N. Bruno, L. Gravano, and A. Marian. Evaluating top-k queries over web-accessible databases. In ICDE, pages 369--380, 2002.
[7]
A. Cali and D. Martinenghi. Querying the deep web. In EDBT, pages 724--727, 2010.
[8]
J. Callan and M. Connell. Query-based sampling of text databases. ACM TOIS, 19(2):97--130, 2001.
[9]
A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das. Unbiased estimation of size and other aggregates over hidden web databases. In SIGMOD, pages 855--866, 2010.
[10]
E. Dragut, T. Kabisch, C. Yu, and U. Leser. A hierarchical approach to model web query interfaces for web source integration. PVLDB, 2(1):325--336, 2009.
[11]
E. Dragut, C. Yu, and W. Meng. Meaningful labeling of integrated query interfaces. In VLDB, pages 679--690, 2006.
[12]
B. He and K. Chang. Statistical schema matching across web query interfaces. In SIGMOD, pages 217--228, 2003.
[13]
B. He, K. Chang, and J. Han. Discovering complex matchings across web query interfaces: A correlation mining approach. In KDD, pages 148--157, 2004.
[14]
P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, pages 394--405, 2002.
[15]
X. Jin, N. Zhang, and G. Das. Attribute domain discovery for hidden web databases. In SIGMOD, pages 553--564, 2011.
[16]
S. Liddle, D. Embley, D. Scott, and S. Yau. Extracting data behind web forms. In ER (Workshops), pages 402--413, 2002.
[17]
J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep-web crawl. PVLDB, 1(2):1241--1252, 2008.
[18]
A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In JCDL, pages 100--109, 2005.
[19]
S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB, pages 129--138, 2001.
[20]
K. Vieira, L. Barbosa, J. Freire, and A. Soares da Silva. Siphon++: a hidden-webcrawler for keyword-based interfaces. In CIKM, pages 1361--1362, 2008.
[21]
Z. Zhang, B. He, and K. Chang. Understanding web query interfaces: best-effort parsing with hidden syntax. In SIGMOD, pages 107--118, 2004.

Cited By

View all
  • (2021)Ranked Deep Web Page Detection Using Reinforcement Learning and Query OptimizationInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.202110010617:4(99-121)Online publication date: 1-Oct-2021
  • (2021)Tailoring data source distributions for fairness-aware data integrationProceedings of the VLDB Endowment10.14778/3476249.347629914:11(2519-2532)Online publication date: 27-Oct-2021
  • (2021)A third-party replication service for dynamic hidden databasesService Oriented Computing and Applications10.1007/s11761-020-00313-x15:4(323-338)Online publication date: 1-Dec-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 5, Issue 11
July 2012
608 pages

Publisher

VLDB Endowment

Publication History

Published: 01 July 2012
Published in PVLDB Volume 5, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)2
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Ranked Deep Web Page Detection Using Reinforcement Learning and Query OptimizationInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.202110010617:4(99-121)Online publication date: 1-Oct-2021
  • (2021)Tailoring data source distributions for fairness-aware data integrationProceedings of the VLDB Endowment10.14778/3476249.347629914:11(2519-2532)Online publication date: 27-Oct-2021
  • (2021)A third-party replication service for dynamic hidden databasesService Oriented Computing and Applications10.1007/s11761-020-00313-x15:4(323-338)Online publication date: 1-Dec-2021
  • (2019)Social Security and Privacy for Social IoT Polymorphic Value SetSecurity and Communication Networks10.1155/2019/54983752019Online publication date: 28-Aug-2019
  • (2019)CRUXProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357976(841-850)Online publication date: 3-Nov-2019
  • (2019)Hierarchical multi-armed bandits for discovering hidden populationsProceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1145/3341161.3342880(145-153)Online publication date: 27-Aug-2019
  • (2019)Progressive Deep Web Crawling Through Keyword Queries For Data EnrichmentProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319899(229-246)Online publication date: 25-Jun-2019
  • (2019)Deep Web crawlingWorld Wide Web10.1007/s11280-018-0602-122:4(1577-1610)Online publication date: 1-Jul-2019
  • (2018)Efficient sampling methods for characterizing POIs on maps based on road networksFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6146-612:3(582-592)Online publication date: 1-Jun-2018
  • (2017)Result Merging for Structured Queries on the Deep Web with Active Relevance Weight EstimationInformation Systems10.1016/j.is.2016.06.00564:C(93-103)Online publication date: 1-Mar-2017
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media