Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3475716.3484187acmconferencesArticle/Chapter ViewAbstractPublication PagesesemConference Proceedingsconference-collections
research-article

Web Application Testing: Using Tree Kernels to Detect Near-duplicate States in Automated Model Inference

Published: 11 October 2021 Publication History

Abstract

Background: In the context of End-to-End testing of web applications, automated exploration techniques (a.k.a. crawling) are widely used to infer state-based models of the site under test. These models, in which states represent features of the web application and transitions represent reachability relationships, can be used for several model-based testing tasks, such as test case generation. However, current exploration techniques often lead to models containing many near-duplicate states, i.e., states representing slightly different pages that are in fact instances of the same feature. This has a negative impact on the subsequent model-based testing tasks, adversely affecting, for example, size, running time, and achieved coverage of generated test suites. Aims: As a web page can be naturally represented by its tree-structured DOM representation, we propose a novel near-duplicate detection technique to improve the model inference of web applications, based on Tree Kernel (TK) functions. TKs are a class of functions that compute similarity between tree-structured objects, largely investigated and successfully applied in the Natural Language Processing domain. Method: To evaluate the capability of the proposed approach in detecting near-duplicate web pages, we conducted preliminary classification experiments on a freely-available massive dataset of about 100k manually annotated web page pairs. We compared the classification performance of the proposed approach with other state-of-the-art near-duplicate detection techniques. Results: Preliminary results show that our approach performs better than state-of-the-art techniques in the near-duplicate detection classification task. Conclusions: These promising results show that TKs can be applied to near-duplicate detection in the context of web application model inference, and motivate further research in this direction to assess the impact of the technique on the quality of the inferred models and on the subsequent application of model-based testing techniques.

References

[1]
Sadia Afroz and Rachel Greenstadt. 2011. Phishzoo: Detecting phishing websites by looking at them. In 2011 IEEE fifth international conference on semantic computing. IEEE, 368--375.
[2]
Francesco Altiero, Anna Corazza, Sergio Di Martino, Adriano Peron, and Luigi Libero Lucio Starace. 2020. Inspecting Code Churns to Prioritize Test Cases. In IFIP International Conference on Testing Software and Systems. Springer, 272--285.
[3]
Anneliese A Andrews, Jeff Offutt, and Roger T Alexander. 2005. Testing web applications by modeling with FSMs. Software & Systems Modeling 4, 3 (2005), 326--345.
[4]
Matteo Biagiola, Filippo Ricca, and Paolo Tonella. 2017. Search based path and input data generation for web application testing. In International Symposium on Search Based Software Engineering. Springer, 18--32.
[5]
Matteo Biagiola, Andrea Stocco, Filippo Ricca, and Paolo Tonella. 2019. Diversity-based web test generation. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 142--153.
[6]
Andrei Z Broder, Steven C Glassman, Mark S Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. Computer networks and ISDN systems 29, 8-13 (1997), 1157--1166.
[7]
Andreas Bruns, Andreas Kornstadt, and Dennis Wichmann. 2009. Web application tests with selenium. IEEE software 26, 5 (2009), 88--91.
[8]
Hari Sankar Chaini and Sateesh Kumar Pradhan. 2015. Test script execution and effective result analysis in hybrid test automation framework. In 2015 International Conference on Advances in Computer Engineering and Applications. IEEE, 214--217.
[9]
Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. 380--388.
[10]
Teh-Chung Chen, Scott Dick, and James Miller. 2010. Detecting visually similar web pages: Application to phishing detection. ACM Transactions on Internet Technology (TOIT) 10, 2 (2010), 1--38.
[11]
Anna Corazza, Sergio Di Martino, Valerio Maggio, and Giuseppe Scanniello. 2010. A tree kernel based approach for clone detection. In 2010 IEEE International Conference on Software Maintenance. IEEE, 1--5.
[12]
Giuseppe Antonio Di Lucca, Massimiliano Di Penta, Anna Rita Fasolino, and Pasquale Granato. 2001. Clone analysis in the web era: An approach to identify cloned web pages. In Seventh Workshop on Empirical Studies of Software Maintenance. 107.
[13]
Sergio Di Martino, Anna Rita Fasolino, Luigi Libero Lucio Starace, and Porfirio Tramontana. 2021. Comparing the effectiveness of capture and replay against automatic input generation for Android graphical user interface testing. Software Testing, Verification and Reliability 31, 3 (2021), e1754.
[14]
Amin Milani Fard and Ali Mesbah. 2013. Feedback-directed exploration of web applications to derive test models. In ISSRE, Vol. 13. 278--287.
[15]
Dennis Fetterly, Mark Manasse, and Marc Najork. 2003. On the evolution of clusters of near-duplicate web pages. In Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No. 03EX726). IEEE, 37--45.
[16]
Simone Filice, Giuseppe Castellucci, Danilo Croce, and Roberto Basili. 2015. Kelp: a kernel-based learning platform for natural language processing. In Proceedings of ACL-IJCNLP 2015 System Demonstrations. 19--24.
[17]
Abhishek Gangwar, Eduardo Fidalgo, Enrique Alegre, and Victor González-Castro. 2018. PhishFingerprint: A Practical Approach for Phishing Web Page Identity Retrieval Based on Visual Cues. In International Conference of Applications of Intelligent Systems.
[18]
Monika Henzinger. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 284--291.
[19]
Taichi Ishikawa, Yu-Lu Liu, David Lawrence Shepard, and Kilho Shin. 2020. Machine learning for tree structures in fake site detection. In Proceedings of the 15th International Conference on Availability, Reliability and Security. 1--10.
[20]
Manuel Leithner and Dimitris E Simos. 2020. XIEv: dynamic analysis for crawling and modeling of web applications. In Proceedings of the 35th Annual ACM Symposium on Applied Computing. 2201--2210.
[21]
Sonal Mahajan and William GJ Halfond. 2014. Finding HTML presentation failures using image comparison techniques. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 91--96.
[22]
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web. 141--150.
[23]
Alessandro Marchetto, Paolo Tonella, and Filippo Ricca. 2008. State-based testing of Ajax web applications. In 2008 1st International Conference on Software Testing, Verification, and Validation. IEEE, 121--130.
[24]
Ali Mesbah, Engin Bozdag, and Arie Van Deursen. 2008. Crawling Ajax by inferring user interface state changes. In 2008 Eighth International Conference on Web Engineering. IEEE, 122--134.
[25]
Alessandro Moschitti. 2006. Efficient convolution kernels for dependency and constituent syntactic trees. In European Conference on Machine Learning. Springer, 318--329.
[26]
Alessandro Moschitti. 2006. Making tree kernels practical for natural language learning. In 11th conference of the European Chapter of the Association for Computational Linguistics.
[27]
Mateusz Pawlik and Nikolaus Augsten. 2015. Efficient computation of the tree edit distance. ACM Transactions on Database Systems (TODS) 40, 1 (2015), 1--40.
[28]
Filippo Ricca, Maurizio Leotta, and Andrea Stocco. 2019. Three open problems in the context of E2E web testing and a vision: NEONATE. In Advances in Computers. Vol. 113. Elsevier, 89--133.
[29]
Filippo Ricca and Paolo Tonella. 2001. Analysis and testing of web applications. In Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001. IEEE, 25--34.
[30]
Kilho Shin, Taichi Ishikawa, Yu-Lu Liu, and David Lawrence Shepard. 2021. Learning DOM Trees of Web Pages by Subpath Kernel and Detecting Fake e-Commerce Sites. Machine Learning and Knowledge Extraction 3, 1 (2021), 95--122.
[31]
Andrea Stocco, Maurizio Leotta, Filippo Ricca, and Paolo Tonella. 2016. Clustering-aided page object generation for web testing. In International Conference on Web Engineering. Springer, 132--151.
[32]
Andrea Stocco, Maurizio Leotta, Filippo Ricca, and Paolo Tonella. 2017. APOGEN: automatic page object generator for web testing. Software Quality Journal 25, 3 (2017), 1007--1039.
[33]
Near Duplicate Study. 2019. Near-Duplicate Study DataSet. https://doi.org/10.5281/zenodo.3376730
[34]
Michael J Swain and Dana H Ballard. 1992. Indexing via color histograms. In Active perception and robot vision. Springer, 261--273.
[35]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612.
[36]
Liu Wenyin, Guanglin Huang, Liu Xiaoyue, Zhang Min, and Xiaotie Deng. 2005. Detection of phishing webpages based on visual similarity. In Special interest tracks and posters of the 14th international conference on World Wide Web. 1060--1061.
[37]
Rahulkrishna Yandrapally, Andrea Stocco, and Ali Mesbah. 2020. Near-duplicate detection in web app model inference. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 186--197.
[38]
Bian Yang, Fan Gu, and Xiamu Niu. 2006. Block mean value based image perceptual hashing. In 2006 International Conference on Intelligent Information Hiding and Multimedia. IEEE, 167--172.
[39]
Hector Yee, Sumanita Pattanaik, and Donald P Greenberg. 2001. Spatiotemporal sensitivity and visual attention for efficient rendering of dynamic environments. ACM Transactions on Graphics (TOG) 20, 1 (2001), 39--65.
[40]
Christoph Zauner. 2010. Implementation and benchmarking of perceptual image hash functions. (2010).

Cited By

View all
  • (2024)Guess the State: Exploiting Determinism to Improve GUI Exploration EfficiencyIEEE Transactions on Software Engineering10.1109/TSE.2024.336658650:4(836-853)Online publication date: Apr-2024
  • (2023)E2E-Loader: A Framework to Support Performance Testing of Web Applications2023 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST57152.2023.00040(351-361)Online publication date: Apr-2023
  • (2023)A Neural Embedding-based Recommender System to Get the Most out of EV Recharge Times2023 IEEE International Conference on Electrical Systems for Aircraft, Railway, Ship Propulsion and Road Vehicles & International Transportation Electrification Conference (ESARS-ITEC)10.1109/ESARS-ITEC57127.2023.10114855(1-6)Online publication date: 29-Mar-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEM '21: Proceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
October 2021
368 pages
ISBN:9781450386654
DOI:10.1145/3475716
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Model inference
  2. Model-based testing
  3. Near-duplicate detection
  4. Reverse engineering
  5. Tree kernels
  6. Web Application Testing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ESEM '21
Sponsor:

Acceptance Rates

ESEM '21 Paper Acceptance Rate 24 of 124 submissions, 19%;
Overall Acceptance Rate 130 of 594 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)6
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Guess the State: Exploiting Determinism to Improve GUI Exploration EfficiencyIEEE Transactions on Software Engineering10.1109/TSE.2024.336658650:4(836-853)Online publication date: Apr-2024
  • (2023)E2E-Loader: A Framework to Support Performance Testing of Web Applications2023 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST57152.2023.00040(351-361)Online publication date: Apr-2023
  • (2023)A Neural Embedding-based Recommender System to Get the Most out of EV Recharge Times2023 IEEE International Conference on Electrical Systems for Aircraft, Railway, Ship Propulsion and Road Vehicles & International Transportation Electrification Conference (ESARS-ITEC)10.1109/ESARS-ITEC57127.2023.10114855(1-6)Online publication date: 29-Mar-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media