Sabaragamuwa University of Sri Lanka

Identifying Duplicate GitHub Issues in Open-Source Repositories Using Deep Learning

Show simple item record

dc.contributor.author Dharmadasa, T.K.R.S.
dc.contributor.author Rupasingha, R.A.H.M.
dc.contributor.author Kumara, B.T.G.S.
dc.date.accessioned 2025-12-12T09:26:03Z
dc.date.available 2025-12-12T09:26:03Z
dc.date.issued 2025-02-19
dc.identifier.citation Abstracts of the ComURS2025 Computing Undergraduate Research Symposium 2025, Faculty of Computing, Sabaragamuwa University of Sri Lanka. en_US
dc.identifier.isbn 978-624-5727-57-5
dc.identifier.uri http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/4960
dc.description.abstract GitHub is a popular platform that is used to maintain software repositories, where users can publish bugs, feature requests, and questions in the form of GitHub issues. Due to the uncoordinated nature of the open-source repositories, which are hosted publicly, there is a huge probability of creating duplicate GitHub issues that may lead to redundant efforts. After manually identifying a duplicate GitHub issue, the standard practice to mark that as a duplicate is to add the corresponding duplicate tag and move it to the closed-issue section. To overcome this manual detection, the study proposes an automated solution that classifies issues as duplicates or non-duplicates using a combination of pre-engineered features and deep learning algorithms. The proposed methodology extracts over 5000 duplicate and non-duplicate GitHub issues using the GitHub Application Programming Interface. Creating the dataset was challenging due to differences in issue reports. After pre-processing the data set by lowercasing, stop word removal, lemmatizing, and tokenizing, it is converted into a machine-trainable format. A feature vector is then constructed using various feature extraction methods such as Term Frequency - Inverse Document Frequency, Word2Vec, SBERT and semantic similarity metrics such as cosine similarity, along with additional features derived from the data. Then the created feature vector serves as an input for different deep learning models including Long Short-Term Memory (LSTM), Convolutional Neural Network, Artificial Neural Network, and Recurrent Neural Network. After evaluating the performance of each algorithm, the LSTM model demonstrated superior results achieving 88% classification accuracy along with the highest precision, recall, f-measure, and lowest error rates compared to other models showcasing its temporal pattern recognition ability. With the proposed approach, duplicate GitHub issues can be automatically detected, reducing manual effort and preventing repositories from accumulating similar issues while allowing for comprehensive discussions on existing ones. en_US
dc.language.iso en en_US
dc.publisher Faculty of Computing, Sabaragamuwa University of Sri Lanka en_US
dc.subject Deep learning en_US
dc.subject Duplicate detection en_US
dc.subject GitHub Issues en_US
dc.title Identifying Duplicate GitHub Issues in Open-Source Repositories Using Deep Learning en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account