Identifying Duplicate GitHub Issues in Open-Source Repositories Using Deep Learning

Dharmadasa, T.K.R.S.; Rupasingha, R.A.H.M.; Kumara, B.T.G.S.

Digital Library | SUSL Home
→
Research Publications
→
Proceedings
→
Workshops, Seminars, Symposiums ect
→
Faculty of Computing
→
COMPUTING UNDERGRADUATE RESEARCH SYMPOSIUM
→
Abstracts of the ComURS2025 Computing Undergraduate Research Symposium 2025
→
View Item

Identifying Duplicate GitHub Issues in Open-Source Repositories Using Deep Learning

Dharmadasa, T.K.R.S.; Rupasingha, R.A.H.M.; Kumara, B.T.G.S.

URI: http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/4960

Date: 2025-02-19

Abstract:

GitHub is a popular platform that is used to maintain software repositories, where users can publish bugs, feature requests, and questions in the form of GitHub issues. Due to the uncoordinated nature of the open-source repositories, which are hosted publicly, there is a huge probability of creating duplicate GitHub issues that may lead to redundant efforts. After manually identifying a duplicate GitHub issue, the standard practice to mark that as a duplicate is to add the corresponding duplicate tag and move it to the closed-issue section. To overcome this manual detection, the study proposes an automated solution that classifies issues as duplicates or non-duplicates using a combination of pre-engineered features and deep learning algorithms. The proposed methodology extracts over 5000 duplicate and non-duplicate GitHub issues using the GitHub Application Programming Interface. Creating the dataset was challenging due to differences in issue reports. After pre-processing the data set by lowercasing, stop word removal, lemmatizing, and tokenizing, it is converted into a machine-trainable format. A feature vector is then constructed using various feature extraction methods such as Term Frequency - Inverse Document Frequency, Word2Vec, SBERT and semantic similarity metrics such as cosine similarity, along with additional features derived from the data. Then the created feature vector serves as an input for different deep learning models including Long Short-Term Memory (LSTM), Convolutional Neural Network, Artificial Neural Network, and Recurrent Neural Network. After evaluating the performance of each algorithm, the LSTM model demonstrated superior results achieving 88% classification accuracy along with the highest precision, recall, f-measure, and lowest error rates compared to other models showcasing its temporal pattern recognition ability. With the proposed approach, duplicate GitHub issues can be automatically detected, reducing manual effort and preventing repositories from accumulating similar issues while allowing for comprehensive discussions on existing ones.

Show full item record