Multilayer Perceptron-based Source Code Classification

Mohamed, I.; Kumara, B.T.G.S.; Banujan, K.

Digital Library | SUSL Home
→
Research Publications
→
Proceedings
→
Workshops, Seminars, Symposiums ect
→
Faculty of Applied Sciences
→
Applied Sciences Undergraduate Research Symposium (APSURS) 2022
→
View Item

Multilayer Perceptron-based Source Code Classification

Mohamed, I.; Kumara, B.T.G.S.; Banujan, K.

URI: http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/3937

Date: 2022-04-06

Abstract:

One of the most crucial stages in the software development life cycle is the implementation stage. Source code is the most critical component in a software application. Developers develop new source code from scratch or reuse old program code functionalities according to project’s requirements. Instead of developing source code functionalities, most programmers devote considerable time seeking and searching old source files. Therefore, it is critical to have an effective and efficient way for searching source code functions. Topic modeling is one way for extracting topics from source code. Even though statistical modeling techniques have been used to implement several topic modeling approaches, they possess several limitations. Non-formal code components such as method names, identifiers, and comments are used in this regard. The syntax of a language refers to the rules that define its structure. Without syntax, the semantics of a language are nearly impossible to comprehend. Addressing these concerns, the author used a machine-learning algorithm to predict the source code functionality names. The results are solely dependent on the syntax or algorithm of the source code. This study focuses on three Java project functionalities: primary number, Selection sort, and Fibonacci number. The data set was acquired from the Git open-source repository which is an open-source platform supported by developers worldwide. Four hundred and fifty software projects were analyzed, and 23 variables were considered. The source code components are extracted using the Java parser library, creating an abstract syntax tree to extract the source code features precisely. Then an algorithm is developed to get the count matrices of source code features. The data set was then fed into an Artificial Neural Network machine learning model which yielded 95.4% accuracy rate, 95.5% precision, 95.4% recall, and 95.4% F1-score, with a low error rate of 0.033.

Show full item record