research-article

DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection

Authors:

Anes Bendimerad,

Romain Mathonat,

Mehdi KaytoueAuthors Info & Claims

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Article No.: 198, Pages 1 - 12

https://doi.org/10.1145/3597503.3639146

Published: 12 April 2024 Publication History

Abstract

Automatic crash bucketing is a crucial phase in the software development process for efficiently triaging bug reports. It generally consists in grouping similar reports through clustering techniques. However, with real-time streaming bug collection, systems are needed to quickly answer the question: What are the most similar bugs to a new one?, that is, efficiently find near-duplicates. It is thus natural to consider nearest neighbors search to tackle this problem and especially the well-known locality-sensitive hashing (LSH) to deal with large datasets due to its sublinear performance and theoretical guarantees on the similarity search accuracy. Surprisingly, LSH has not been considered in the crash bucketing literature. It is indeed not trivial to derive hash functions that satisfy the so-called locality-sensitive property for the most advanced crash bucketing metrics. Consequently, we study in this paper how to leverage LSH for this task. To be able to consider the most relevant metrics used in the literature, we introduce DeepLSH, a Siamese DNN architecture with an original loss function, that perfectly approximates the locality-sensitivity property even for Jaccard and Cosine metrics for which exact LSH solutions exist. We support this claim with a series of experiments on an original dataset, which we make available.

References

[1]

Kevin Bartz, Jack W. Stokes, John C. Platt, Ryan Kivett, David Grant, Silviu Calinoiu, and Gretchen Loihle. 2008. Finding Similar Failures Using Callstack Similarity. In SysML.

[2]

Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 18, 9 (1975), 509--517.

Digital Library

[3]

Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When Is "Nearest Neighbor" Meaningful?. In Database Theory - ICDT '99, 7th International Conference, 1999, Proceedings. 217--235.

[4]

Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic Clustering of the Web. Comput. Networks 29, 8--13 (1997), 1157--1166.

[5]

Mark Brodie, Sheng Ma, Guy M. Lohman, Laurent Mignet, Natwar Modani, Mark Wilding, Jon Champlin, and Peter Sohn. 2005. Quickly Finding Known Software Problems via Automated Symptom Matching. In Second International Conference on Autonomic Computing (ICAC 2005). IEEE Computer Society, 101--110.

Digital Library

[6]

Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing. 380--388.

Digital Library

[7]

Nick Craswell. 2009. Mean Reciprocal Rank. Springer US, 1703--1703.

[8]

Yingnong Dang, Rongxin Wu, Hongyu Zhang, Dongmei Zhang, and Peter Nobel. 2012. ReBucket: A method for clustering duplicate crash reports based on call stack similarity. In 34th International Conference on Software Engineering, ICSE. 1084--1093.

[9]

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry. 253--262.

[10]

Tejinder Dhaliwal, Foutse Khomh, and Ying Zou. 2011. Classifying field crash reports for fixing bugs: A case study of Mozilla Firefox. In IEEE 27th International Conference on Software Maintenance, ICSM. 333--342.

Digital Library

[11]

Thanh-Toan Do, Anh-Dzung Doan, and Ngai-Man Cheung. 2016. Learning to Hash with Binary Deep Neural Network. In Computer Vision - ECCV 2016 - 14th European Conference. 219--234.

[12]

Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince R. Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen C. Hunt. 2009. Debugging in the (very) large: ten years of implementation and experience. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles, SOSP. 103--116.

[13]

Kun He, Fatih Çakir, Sarah Adel Bargal, and Stan Sclaroff. 2018. Hashing as Tie-Aware Learning to Rank. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 4023--4032.

[14]

Weiming Hu, Yabo Fan, Junliang Xing, Liang Sun, Zhaoquan Cai, and Stephen J. Maybank. 2018. Deep Constrained Siamese Hash Coding Network and Load-Balanced Locality-Sensitive Hashing for Near Duplicate Image Detection. IEEE Trans. Image Process. 27, 9 (2018), 4452--4464.

Digital Library

[15]

Gao Huang, Chuan Guo, Matt J. Kusner, Yu Sun, Fei Sha, and Kilian Q. Weinberger. 2016. Supervised Word Mover's Distance. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems. 4862--4870.

[16]

Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing. 604--613.

Digital Library

[17]

Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In 29th International Conference on Software Engineering (ICSE'07). IEEE, 96--105.

Digital Library

[18]

Norio Katayama and Shin'ichi Satoh. 1997. The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data. 369--380.

Digital Library

[19]

Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81--93.

[20]

Sunghun Kim, Thomas Zimmermann, and Nachiappan Nagappan. 2011. Crash graphs: An aggregated view of multiple crashes to improve crash triage. In Proceedings of the 2011 IEEE/IFIP International Conference on Dependable Systems and Networks, DSN. 486--493.

Digital Library

[21]

Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. 2015. Simultaneous feature learning and hash coding with deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 3270--3278.

[22]

Johannes Lerch and Mira Mezini. 2013. Finding Duplicates of Your Yet Unwritten Bug Report. In 17th European Conference on Software Maintenance and Reengineering, CSMR. 69--78.

[23]

Qi Li, Zhenan Sun, Ran He, and Tieniu Tan. 2017. Deep Supervised Discrete Hashing. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 2482--2491.

[24]

Kevin Lin, Huei-Fang Yang, Jen-Hao Hsiao, and Chu-Song Chen. 2015. Deep learning of binary hash codes for fast image retrieval. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2015, Boston, MA, USA, June 7--12, 2015. IEEE Computer Society, 27--35.

[25]

Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. 2015. Deep hashing for compact binary codes learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 2475--2483.

[26]

Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2016. Deep Supervised Hashing for Fast Image Retrieval. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 2064--2072.

[27]

Lucene. [n. d.]. Lucene Apache. https://lucene.apache.org/

[28]

Xiao Luo, Chong Chen, Huasong Zhong, Hao Zhang, Minghua Deng, Jianqiang Huang, and Xiansheng Hua. 2020. A Survey on Deep Hashing Methods. CoRR abs/2003.03369 (2020).

[29]

Natwar Modani, Rajeev Gupta, Guy M. Lohman, Tanveer Fathima Syeda-Mahmood, and Laurent Mignet. 2007. Automatically Identifying Known Software Problems. In Proceedings of the 23rd International Conference on Data Engineering Workshops. IEEE Computer Society, 433--441.

Digital Library

[30]

Akira Moroo, Akiko Aizawa, and Takayuki Hamamoto. 2017. Reranking-based Crash Report Deduplication. In The 29th International Conference on Software Engineering and Knowledge Engineering, Xudong He (Ed.). KSI Research Inc. and Knowledge Systems Institute Graduate School, 507--510.

[31]

Mozilla. 2012. Mozilla Crash Reporter. https://crash-stats.mozilla.org/

[32]

Saul B Needleman and Christian D Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48, 3 (1970), 443--453.

[33]

Korosh Koochekian Sabor, Abdelwahab Hamou-Lhadj, and Alf Larsson. 2017. DURFEX: A Feature Extraction Technique for Efficient Detection of Duplicate Bug Reports. In 2017 IEEE International Conference on Software Quality, Reliability and Security, QRS 2017. 240--250.

[34]

Caitlin Sadowski and Greg Levin. 2007. Simhash: Hash-based similarity detection.

[35]

Mark Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Found. Trends Inf. Retr. 4, 4 (2010), 247--375.

[36]

Adrian Schröter, Nicolas Bettenburg, and Rahul Premraj. 2010. Do stack traces help developers fix bugs?. In Proceedings of the 7th International Working Conference on Mining Software Repositories, MSR 2010 (Co-located with ICSE). 118--121.

[37]

Fran Silavong, Sean Moran, Antonios Georgiadis, Rohan Saphal, and Robert Otter. 2022. Senatus: a fast and accurate code-to-code recommendation engine. In Proceedings of the 19th International Conference on Mining Software Repositories. 511--523.

Digital Library

[38]

Roman Vasiliev, Dmitrij V. Koznov, George A. Chernishev, Aleksandr Khvorov, Dmitry V. Luciv, and Nikita Povarov. 2020. TraceSim: a method for calculating stack trace similarity. In Proceedings of the 4th ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation, FSE 2020. 25--30.

Digital Library

[39]

Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014. Hashing for Similarity Search: A Survey. CoRR abs/1408.2927 (2014).

[40]

Xiaofang Wang, Yi Shi, and Kris M. Kitani. 2016. Deep Supervised Hashing with Triplet Labels. In Computer Vision - ACCV 2016 - 13th Asian Conference on Computer Vision. 70--84.

[41]

Ziming Zhang, Yuting Chen, and Venkatesh Saligrama. 2016. Efficient Training of Very Deep Neural Networks for Supervised Hashing. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 1487--1495.

Cited By

Xu YLi YXie QZhang XLiu CBach TKim SKang S(2024)CrashChecker: A Fusion Method for Clustering Duplicate Crash Failures in SAP HANA Delivery2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW63542.2024.00044(37-42)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSREW63542.2024.00044

Index Terms

DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection

Recommendations

Locality-Sensitive Hashing for Chi2 Distance

In the past 10 years, new powerful algorithms based on efficient data structures have been proposed to solve the problem of Nearest Neighbors search (or Approximate Nearest Neighbors search). If the Euclidean Locality Sensitive Hashing algorithm, which ...
Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm

The problem of c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional space is fundamentally important in many applications, such as image database and data mining. Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing ...
On locality-sensitive indexing in generic metric spaces
SISAP '10: Proceedings of the Third International Conference on SImilarity Search and APplications

The concept of Locality-sensitive Hashing (LSH) has been successfully used for searching in high-dimensional data and a number of locality-preserving hash functions have been introduced. In order to extend the applicability of the LSH approach to a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

May 2024

2942 pages

ISBN:9798400702174

DOI:10.1145/3597503

Co-chairs:
Ana Paiva,
Rui Abreu,
Program Co-chairs:
Abhik Roychoudhury,
Margaret Storey

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 April 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICSE '24

Sponsor:

SIGSOFT

ICSE '24: IEEE/ACM 46th International Conference on Software Engineering

April 14 - 20, 2024

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
139
Total Downloads

Downloads (Last 12 months)139
Downloads (Last 6 weeks)37

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu YLi YXie QZhang XLiu CBach TKim SKang S(2024)CrashChecker: A Fusion Method for Clustering Duplicate Crash Failures in SAP HANA Delivery2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW63542.2024.00044(37-42)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSREW63542.2024.00044

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten