Stratified sampling for data mining on the deep web

Tantan Liu¹,
Fan Wang¹ &
Gagan Agrawal¹

389 Accesses
10 Citations
Explore all metrics

Abstract

In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multiple-Phase Stratification-Based Hierarchical Clustering Over a Deep Web Data Source

Deep Learning Techniques in Big Data Analytics

Deep learning applications and challenges in big data analytics

Article Open access 24 February 2015

References

Bergman M K. White paper: the deep web: surfacing hidden value. Journal of Electronic Publishing, 2001, 7(1)
Braga D, Ceri S, Daniel F, Martinenghi D. Optimization of multidomain queries on the web. Proceedings of the VLDB Endowment, 2008, 1(1): 562–673
Google Scholar
Cali A, Martinenghi D. Querying data under access limitations. In: Proceedings of the 24th IEEE International Conference on Data Engineering. 2008, 50–59
Toivonen H. Sampling large databases for association rules. In: Proceedings of the 22nd International Conference on Very Large Data Bases. 1996, 134–145
Parthasarathy S. Efficient progressive sampling for association rules. In: Proceedings of the 2002 IEEE International Conference on Data Mining. 2002, 354–361
Chen B, Haas P, Scheuermann P. A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 462–468
Boley M, Grosskreutz H. A randomized approach for approximating the number of frequent sets. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 43–52
Campagna A, Pagh R. Finding associations and computing similarity via biased pair sampling. Knowledge and Information Systems (in Press)
Bar-Yossef Z, Gurevich M. Mining search engine query logs via suggestion sampling. Proceedings of the VLDB Endowment, 2008, 1(1): 54–65
Google Scholar
Dasgupta A, Das G, Mannila H. A random walk approach to sampling hidden databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. 2007, 629–640
Dasgupta A, Zhang N, Das G. Leveraging count information in sampling hidden databases. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 329–340
Liu B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). New York: Springer-Verlag, 2006
Google Scholar
Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases. 1994, 487–499
Zaki M J. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 2000, 12(3): 372–390
Article MathSciNet Google Scholar
Salam A, Khayal M S H. Mining top-k frequent patterns without minimum support threshold. Knowledge and Information Systems, 2012, 30(1): 57–86
Article Google Scholar
Gaya M C, Giráldez J I. Merging local patterns using an evolutionary approach. Knowledge and Information Systems, 2011, 29(1): 1–24
Article Google Scholar
Wang F, Agrawal G, Jin R, Piontkivska H. Snpminer: a domainspecific deep web mining tool. In: Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering. 2007, 192–199
Cochran W. Sampling Techniques. 3rd ed. New York: Wiley and Sons, 1977
MATH Google Scholar
Press W H, Farrar G R. Recursive stratified sampling for multidimensional Monte Carlo integration. Computers in Physics, 1990, 4(2): 190–195
Google Scholar
Caflisch R E. Monte Carlo and quasi-Monte Carlo methods. Acta Numerica, 1998, 7: 1–49
Article MathSciNet Google Scholar
Bay S D, Pazzani M J. Detecting group differences: mining contrast sets. Data Mining and Knowledge Discovery, 2001, 5(3): 213–246
Article MATH Google Scholar
Loekito E, Bailey J. Mining influential attributes that capture class and group contrast behaviour. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 971–980
Brin S, Motwani R, Silverstein C. Beyond market baskets: generalizing association rules to correlations. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. 1997, 265–276
Rubin D B. Matching to remove bias in observational studies. Biometrics, 1973, 29(1): 159–183
Article Google Scholar
Anderson D W, Kish L, Cornell R G. On stratification, grouping and matching. Scandinavian Journal of Statistics, 1980, 7(2): 61–66
MathSciNet MATH Google Scholar
Sprinthall R C. Basic Statistical Analysis. 7th ed. Boston: Allyn and Bacon, 2003
Google Scholar
Pasquier N, Bastide Y, Taouil R, Lakhal L. Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th International Conference on Database Theory. 1999, 398–416
Gunopulos D, Mannila H, Khardon R, Toivonen H. Data mining, hypergraph transversals, and machine learning (extended abstract). In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 1997, 209–216
Bayardo R J. Efficiently mining long patterns from databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. 1998, 85–93
Foreman E K. Survey Sampling Principles. New York: Marcel Dekker publishers, 1991
MATH Google Scholar
Dasgupta A, Jin X, Jewell B, Zhang N, Das G. Unbiased estimation of size and other aggregates over hidden web databases. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 855–866
Jin R, Glimcher L, Jermaine C, Agrawal G. New sampling-based estimators for OLAP queries. In: Proceedings of the 22nd International Conference on Data Engineering. 2006
Afrati F N, Lekeas P V, Li C. Adaptive-sampling algorithms for answering aggregation queries on web sites. Data and Knowledge Engineering, 2008, 64(2): 462–490
Article Google Scholar
Joshi S, Jermaine C. Robust stratified sampling plans for low selectivity queries. In: Proceedings of the 24th International Conference on Data Engineering. 2008, 199–208
Wu M, Jermaine C. Guessing the extreme values in a data set: a Bayesian method and its applications. VLDB Journal, 2009, 18(2): 571–597
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210, USA
Tantan Liu, Fan Wang & Gagan Agrawal

Authors

Tantan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Fan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Gagan Agrawal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tantan Liu.

Additional information

Tantan Liu received her BEng in Electronic and Information Engineering from University of Technology of China in 2004, and her MSc in Signal and Information Processing from Chinese Academy of Sciences in 2007. Currently, she is a PhD candidate in the department of Computer Science and Engineering of Ohio State University, under the supervision of Prof. Gagan Agrawal. Her research interests are data mining over the deep web and statistical learning.

Fan Wang received his BEng in Computer Science from Beijing University of Technology in 2001, his MSc in 2009 and his PhD in 2010 both in Computer Science from Ohio State University. Recently, he became a software development engineer in Microsoft Corp. Dr. Wang’s current research interests are real time web indexing, web search, and data mining.

Gagan Agrawal is a professor of computer science at Ohio State University. He received his bachelor’s degree from IIT Kanpur, and his MSc and PhD degrees from the University of Maryland. His research interests include high performance and data-intensive computing and data mining.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, T., Wang, F. & Agrawal, G. Stratified sampling for data mining on the deep web. Front. Comput. Sci. 6, 179–196 (2012). https://doi.org/10.1007/s11704-012-2859-3

Download citation

Received: 30 May 2011
Accepted: 18 October 2011
Published: 31 March 2012
Issue Date: April 2012
DOI: https://doi.org/10.1007/s11704-012-2859-3

Stratified sampling for data mining on the deep web

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Multiple-Phase Stratification-Based Hierarchical Clustering Over a Deep Web Data Source

Deep Learning Techniques in Big Data Analytics

Deep learning applications and challenges in big data analytics

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Stratified sampling for data mining on the deep web

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Multiple-Phase Stratification-Based Hierarchical Clustering Over a Deep Web Data Source

Deep Learning Techniques in Big Data Analytics

Deep learning applications and challenges in big data analytics

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation