Semantic Metadata-enhanced Deep Learning Techniques for Duplicate Bug Report Detection

Sandunika, D.M.N.; Herath, G.A.C.A.

Digital Library | SUSL Home
→
Research Publications
→
Proceedings
→
Workshops, Seminars, Symposiums ect
→
Faculty of Computing
→
COMPUTING UNDERGRADUATE RESEARCH SYMPOSIUM
→
ComURS2026 Computing Undergraduate Research Symposium : Abstracts
→
View Item

dc.contributor.author	Sandunika, D.M.N.
dc.contributor.author	Herath, G.A.C.A.
dc.date.accessioned	2026-06-04T08:59:28Z
dc.date.available	2026-06-04T08:59:28Z
dc.date.issued	2026-01-28
dc.identifier.isbn	978-624-5727-44-5
dc.identifier.uri	http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/5328
dc.description.abstract	Duplication of bug reports poses a significant impact on software development efficiency with 12-25% of bugs report being duplicated on large projects. Manual identification techniques are both time consuming and prone to error. This study examines whether semantic metadata that encodes content-level similarities together with simplified deep learning architectures can be more effective than relying on complex models. We aim to examine existing practices and determine weaknesses, research and prove semantic content-based metadata to be useful in robust identification, structure and test various deep learning models to define the best methods to use to achieve higher levels of performance. Existing studies reveal that machine learning methods outperform traditional information-retrieval techniques by a significant margin, while deep learning approaches achieve higher accuracy but often suffer from limited feature diversity. We collected 535,477 of quality pairs with a 70%-10%-20% split for training, validation, and testing. Our feature engineering on DistilBERT yielded a mean correlation of 0.1888, surpassing traditional metadata. We compared 6 architectures, namely, LSTM, CNN, LSTM+Metadata, Hybrid+Attention, Hybrid+Attention with Metadata, and proposed LSTM+CNN+Metadata, evaluated through accuracy, precision, recall, F1-score, and AUC-ROC metrics. The proposed architecture resulted 92.53% F1-score, which is better than complex attention-based models. Contextual processing is proven to be better, as LSTM performs higher than CNN. This paper highlights that the quality of features is more critical for model performance than model complexity, presenting a cost-effective, accurate, and scalable method for automated duplicate bug detection in real-world applications.	en_US
dc.language.iso	en	en_US
dc.publisher	Faculty of Computing. Sabaragamuwa University of Sri Lanka.	en_US
dc.subject	Deep Learning	en_US
dc.subject	Duplicate Bug Detection	en_US
dc.subject	LSTM Networks	en_US
dc.subject	Semantic Metadata	en_US
dc.subject	Software Maintenance	en_US
dc.title	Semantic Metadata-enhanced Deep Learning Techniques for Duplicate Bug Report Detection	en_US
dc.type	Article	en_US