Sabaragamuwa University of Sri Lanka

Automated log parsing and anomaly detection using BERT and GPT-2: A large language model approach for IT systems

Show simple item record

dc.contributor.author Sathyanjana, W.W.N.C.
dc.contributor.author Gunawardhane, H.M.K.T.
dc.contributor.author Kumara, B.T.G.S.
dc.date.accessioned 2026-01-17T08:14:27Z
dc.date.available 2026-01-17T08:14:27Z
dc.date.issued 2025-12-03
dc.identifier.issn 2815-0341
dc.identifier.uri http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/5187
dc.description.abstract System logs play a critical role in diagnosing, securing, and optimising an IT system’s operations. However, the traditional log analysis methods like rule-based systems lack the capability to handle the increasing volume, heterogeneity and complexity of log data. In this research, the scalability and adaptability problems of log analysis were addressed by proposing an automated framework powered by LLM models. BERT and GPT-2 were chosen as robust baselines to systematically assess LLM-based log analysis, despite the existence of more recent LLMs. The objective was to evaluate how well these models can extract structured templates from unstructured logs and detect anomalies with minimum human intervention. This study focused on diverse IT log structures that include system, application and security logs. Methodology involved the collection of more than 32,000 log messages from 16 public sources, including LogHub, GitHub and Kaggle. Data preprocessing included an extended pipeline consisting of timestamp normalisation, noise removal, structural parsing, domain normalisation, tokenisation and padding. The log entries were encoded into template labels through TF-IDF and embeddings, with both models fine-tuned over transformer learning methods. Isolation Forest was chosen for anomaly detection as it is effective at handling high-dimensional log data and can identify rare anomalies without the need for labelled samples. Further, it successfully separates unusual patterns in large-scale heterogeneous logs compared to clustering or statistical techniques. The comparative analysis reveals that BERT performed exceptionally well (96.61%) in structured logs, while GPT-2 was more resilient (97.44%) in unstructured data, even though GPT-2 had a higher overall accuracy. These findings suggest that these models can automate log analysis, and each of them can excel in a different situation. This study presents applied usefulness in implementing LLMs for log parsing and anomaly detection to provide real-time, scalable solutions that require minimal manual intervention. On the theoretical level, it expands transformer models beyond typical NLP domains and on the practical level, it integrates them into enterprise IT monitoring. Future works include ensemble models, Explainable AI and integration with real-time streaming. en_US
dc.language.iso en en_US
dc.publisher Sabaragamuwa University of Sri Lanka en_US
dc.subject Anomaly detection en_US
dc.subject Large Language Models (LLM) en_US
dc.subject Log analysis en_US
dc.subject Natural Language Processing (NLP) en_US
dc.subject Transformer models en_US
dc.title Automated log parsing and anomaly detection using BERT and GPT-2: A large language model approach for IT systems en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account