Comparative evaluation of traditional and neural models for multi-class classification of GitHub commit messages

Tharsa, S.; Wijerathine, P.M.A.K.

Digital Library | SUSL Home
→
Research Publications
→
Proceedings
→
Workshops, Seminars, Symposiums ect
→
Faculty of Computing
→
COMPUTING UNDERGRADUATE RESEARCH SYMPOSIUM
→
ComURS2026 Computing Undergraduate Research Symposium : Abstracts
→
View Item

dc.contributor.author	Tharsa, S.
dc.contributor.author	Wijerathine, P.M.A.K.
dc.date.accessioned	2026-06-03T05:42:09Z
dc.date.available	2026-06-03T05:42:09Z
dc.date.issued	2026-01-28
dc.identifier.isbn	978-624-5727-44-5
dc.identifier.uri	http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/5314
dc.description.abstract	Commit messages are a critical source of information within software repositories, documenting changes made to code entries. While they are important to empirical software engineering and development cycles, these messages are often highly heterogeneous, informal, and brief. This results in a variety of challenges, especially when analyzing large scale repositories with thousands of commits. The lack of standardization makes it difficult to analyze the content of these repositories, leading to a need and opportunity to automate the processing of commit messages. Automated processing would enable the development of release notes, code change impact analyses, and maintenance analytics. This study evaluates the use of several text classification models aimed at analyzing messages within the software development life cycle (SDLC). In particular, these models attempt to classify the commit messages from the repository hosting service GitHub into ten SLDC tasks: build, chore, continuous integration (ci), documentation (docs), feature (feat), fix, performance (perf), refactor, style, and test. After identifying a public repository containing a balanced dataset of 2000 commit messages, several hours of data cleaning and manual labeling were conducted to ensure a dataset of high quality. Each of the ten classes was assigned 200 commit messages after which the dataset was fully categorized. Across the different model architectures, a stratified split was applied to each. These included the Optuna-tuned TF-IDF + LinearSVM, the Bi-LSTM, and the Fine-tuned BERT, and the Hybrid BERT-SVM model. As for the transformer models, I customarily encountered convergence problems with this domain-specific data, and for this reason I applied the additional simplifying techniques of Layer-wise Learning Rate Decay (LLRD) and a weighted loss for BERT as part of the loss function. BERT (or any transformer model) for the task of sequence classification is known to suffer from convergence problems. The findings indicate that, of all the BERT architectures, the Optuna-tuned TF-IDF + LinearSVM secured the maximum validation accuracy of 0.635 and macro F1 of 0.631. The primary challenge was the generalized language of the developers that blurred the semantic distinctions between chore and refactor and confused all of the models. This sets a solid scholarly foundation for subsequent work that aims to incorporate source code differences to address the existing gaps in classification.	en_US
dc.language.iso	en	en_US
dc.publisher	Faculty of Computing. Sabaragamuwa University of Sri Lanka.	en_US
dc.subject	BERT	en_US
dc.subject	Commit Messages	en_US
dc.subject	Software Repositories	en_US
dc.subject	Support Vector Machine	en_US
dc.subject	Text Classification	en_US
dc.title	Comparative evaluation of traditional and neural models for multi-class classification of GitHub commit messages	en_US
dc.type	Article	en_US