Machine learning based detection of software vulnerabilities in C code

Junoj, S.; Wijeratne, P.M.A.K.; Kumara, B.T.G.S.

Digital Library | SUSL Home
→
Research Publications
→
Proceedings
→
Workshops, Seminars, Symposiums ect
→
Faculty of Computing
→
COMPUTING UNDERGRADUATE RESEARCH SYMPOSIUM
→
Abstracts of the ComURS2025 Computing Undergraduate Research Symposium 2025
→
View Item

dc.contributor.author	Junoj, S.
dc.contributor.author	Wijeratne, P.M.A.K.
dc.contributor.author	Kumara, B.T.G.S.
dc.date.accessioned	2025-12-12T10:24:56Z
dc.date.available	2025-12-12T10:24:56Z
dc.date.issued	2025-02-19
dc.identifier.citation	Abstracts of the ComURS2025 Computing Undergraduate Research Symposium 2025, Faculty of Computing, Sabaragamuwa University of Sri Lanka.	en_US
dc.identifier.isbn	978-624-5727-57-5
dc.identifier.uri	http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/4975
dc.description.abstract	Software security vulnerability detection is an integral part of creating secure and reliable software. C programming is used extensively in system-level and embedded applications for efficiency and direct control over hardware resources without inherent security features, thus being especially vulnerable to common categories of attacks, including buffer overflow and null pointer dereferences. Classic security vulnerability detection methods based on manual code reviews and static analysis tools are unable to discover complicated security bugs. Given the shortcomings of previous approaches, this research work presents a machine learning-based approach to automate the detection and classification of vulnerabilities in C code. Datasets were gathered from Kaggle and IEEE DataPort, consisting of real-world samples of C code that are quite varied. Feature extraction was performed with the Word2Vec model, which is more powerful than traditional frequency-based methods for capturing semantic and contextual relationships in code. Various machine learning and deep learning models have been explored in the research: Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Convolutional Neural Network (CNN), and Long-Term Short Memory (LSTM). Later on, a hybrid CNN-LSTM model is suggested for better results. These models were then developed, trained, validated, and tested using the 80-10-10 split, evaluated based on the accuracy, precision, recall, and F1-score. These results show that the Decision Tree model had the highest accuracy of 93.46% in vulnerability detection, while the hybrid CNN-LSTM model performed best in classification with an accuracy of 94.55%. These results prove that machine learning significantly enhances software vulnerability detection compared to traditional methods. The study further elucidates how these models can be integrated into real-world software development workflows and improve automated security assessments. Future studies should compare this approach to state-of-the-art vulnerability detection frameworks in order to further tune the machine learning-based security solution.	en_US
dc.language.iso	en	en_US
dc.publisher	Faculty of Computing, Sabaragamuwa University of Sri Lanka	en_US
dc.subject	Convolutional Neural Network	en_US
dc.subject	Vulnerability detection	en_US
dc.subject	vulnerability classification	en_US
dc.subject	Machine learning	en_US
dc.subject	Deep learning	en_US
dc.title	Machine learning based detection of software vulnerabilities in C code	en_US
dc.type	Article	en_US