Semantic Metadata-enhanced Deep Learning Techniques for Duplicate Bug Report Detection

Sandunika, D.M.N.; Herath, G.A.C.A.

Digital Library | SUSL Home
→
Research Publications
→
Proceedings
→
Workshops, Seminars, Symposiums ect
→
Faculty of Computing
→
COMPUTING UNDERGRADUATE RESEARCH SYMPOSIUM
→
ComURS2026 Computing Undergraduate Research Symposium : Abstracts
→
View Item

Semantic Metadata-enhanced Deep Learning Techniques for Duplicate Bug Report Detection

Sandunika, D.M.N.; Herath, G.A.C.A.

URI: http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/5328

Date: 2026-01-28

Abstract:

Duplication of bug reports poses a significant impact on software development efficiency with 12-25% of bugs report being duplicated on large projects. Manual identification techniques are both time consuming and prone to error. This study examines whether semantic metadata that encodes content-level similarities together with simplified deep learning architectures can be more effective than relying on complex models. We aim to examine existing practices and determine weaknesses, research and prove semantic content-based metadata to be useful in robust identification, structure and test various deep learning models to define the best methods to use to achieve higher levels of performance. Existing studies reveal that machine learning methods outperform traditional information-retrieval techniques by a significant margin, while deep learning approaches achieve higher accuracy but often suffer from limited feature diversity. We collected 535,477 of quality pairs with a 70%-10%-20% split for training, validation, and testing. Our feature engineering on DistilBERT yielded a mean correlation of 0.1888, surpassing traditional metadata. We compared 6 architectures, namely, LSTM, CNN, LSTM+Metadata, Hybrid+Attention, Hybrid+Attention with Metadata, and proposed LSTM+CNN+Metadata, evaluated through accuracy, precision, recall, F1-score, and AUC-ROC metrics. The proposed architecture resulted 92.53% F1-score, which is better than complex attention-based models. Contextual processing is proven to be better, as LSTM performs higher than CNN. This paper highlights that the quality of features is more critical for model performance than model complexity, presenting a cost-effective, accurate, and scalable method for automated duplicate bug detection in real-world applications.

Show full item record

Files in this item

Name: ComURS-2026_(2)-p ...

Size: 43.53Kb

Format: PDF

View/Open

This item appears in the following Collection(s)

ComURS2026 Computing Undergraduate Research Symposium : Abstracts [54]
"Next-Gen Solutions for a Digitally Connected World"

Semantic Metadata-enhanced Deep Learning Techniques for Duplicate Bug Report Detection

Semantic Metadata-enhanced Deep Learning Techniques for Duplicate Bug Report Detection

Abstract:

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account