Automated log parsing and anomaly detection using BERT and GPT-2: A large language model approach for IT systems

Sathyanjana, W.W.N.C.; Gunawardhane, H.M.K.T.; Kumara, B.T.G.S.

Digital Library | SUSL Home
→
Research Publications
→
Proceedings
→
Conferences Organized by SUSL
→
University level conferences
→
International Conference of Sabaragamuwa University of Sri Lanka (ICSUSL)
→
The 10th International Conference of Sabaragamuwa University of Sri Lanka and The 6th China-Sri Lanka Communication and Cooperation Forum
→
View Item

Automated log parsing and anomaly detection using BERT and GPT-2: A large language model approach for IT systems

Sathyanjana, W.W.N.C.; Gunawardhane, H.M.K.T.; Kumara, B.T.G.S.

URI: http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/5187

Date: 2025-12-03

Abstract:

System logs play a critical role in diagnosing, securing, and optimising an IT system’s operations. However, the traditional log analysis methods like rule-based systems lack the capability to handle the increasing volume, heterogeneity and complexity of log data. In this research, the scalability and adaptability problems of log analysis were addressed by proposing an automated framework powered by LLM models. BERT and GPT-2 were chosen as robust baselines to systematically assess LLM-based log analysis, despite the existence of more recent LLMs. The objective was to evaluate how well these models can extract structured templates from unstructured logs and detect anomalies with minimum human intervention. This study focused on diverse IT log structures that include system, application and security logs. Methodology involved the collection of more than 32,000 log messages from 16 public sources, including LogHub, GitHub and Kaggle. Data preprocessing included an extended pipeline consisting of timestamp normalisation, noise removal, structural parsing, domain normalisation, tokenisation and padding. The log entries were encoded into template labels through TF-IDF and embeddings, with both models fine-tuned over transformer learning methods. Isolation Forest was chosen for anomaly detection as it is effective at handling high-dimensional log data and can identify rare anomalies without the need for labelled samples. Further, it successfully separates unusual patterns in large-scale heterogeneous logs compared to clustering or statistical techniques. The comparative analysis reveals that BERT performed exceptionally well (96.61%) in structured logs, while GPT-2 was more resilient (97.44%) in unstructured data, even though GPT-2 had a higher overall accuracy. These findings suggest that these models can automate log analysis, and each of them can excel in a different situation. This study presents applied usefulness in implementing LLMs for log parsing and anomaly detection to provide real-time, scalable solutions that require minimal manual intervention. On the theoretical level, it expands transformer models beyond typical NLP domains and on the practical level, it integrates them into enterprise IT monitoring. Future works include ensemble models, Explainable AI and integration with real-time streaming.

Show full item record