Machine learning based detection of software vulnerabilities in C code

Junoj, S.; Wijeratne, P.M.A.K.; Kumara, B.T.G.S.

Digital Library | SUSL Home
→
Research Publications
→
Proceedings
→
Workshops, Seminars, Symposiums ect
→
Faculty of Computing
→
COMPUTING UNDERGRADUATE RESEARCH SYMPOSIUM
→
Abstracts of the ComURS2025 Computing Undergraduate Research Symposium 2025
→
View Item

Machine learning based detection of software vulnerabilities in C code

Junoj, S.; Wijeratne, P.M.A.K.; Kumara, B.T.G.S.

URI: http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/4975

Date: 2025-02-19

Abstract:

Software security vulnerability detection is an integral part of creating secure and reliable software. C programming is used extensively in system-level and embedded applications for efficiency and direct control over hardware resources without inherent security features, thus being especially vulnerable to common categories of attacks, including buffer overflow and null pointer dereferences. Classic security vulnerability detection methods based on manual code reviews and static analysis tools are unable to discover complicated security bugs. Given the shortcomings of previous approaches, this research work presents a machine learning-based approach to automate the detection and classification of vulnerabilities in C code. Datasets were gathered from Kaggle and IEEE DataPort, consisting of real-world samples of C code that are quite varied. Feature extraction was performed with the Word2Vec model, which is more powerful than traditional frequency-based methods for capturing semantic and contextual relationships in code. Various machine learning and deep learning models have been explored in the research: Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Convolutional Neural Network (CNN), and Long-Term Short Memory (LSTM). Later on, a hybrid CNN-LSTM model is suggested for better results. These models were then developed, trained, validated, and tested using the 80-10-10 split, evaluated based on the accuracy, precision, recall, and F1-score. These results show that the Decision Tree model had the highest accuracy of 93.46% in vulnerability detection, while the hybrid CNN-LSTM model performed best in classification with an accuracy of 94.55%. These results prove that machine learning significantly enhances software vulnerability detection compared to traditional methods. The study further elucidates how these models can be integrated into real-world software development workflows and improve automated security assessments. Future studies should compare this approach to state-of-the-art vulnerability detection frameworks in order to further tune the machine learning-based security solution.

Show full item record