| dc.description.abstract |
Transformer-based models like BERT have greatly advanced natural language processing by
providing deep contextualized language representations. However, while their token prediction
capability is strong, their syntax understanding alone does not ensure a clear grasp of semantic
meaning, particularly for phrasal verbs. These statements are frequently non-compositional and
very context dependent, making them difficult for surface pattern-based algorithms to learn.
BERT can predict the next token in a phrasal verb sequence, but it doesn’t always capture the
intended meaning. We developed a dataset combining established phrasal verb definitions with
11894 sentences generated by large language models to increase contextual diversity and robustness.
Using the pre-trained bert-base-uncased model as a baseline, we applied QLoRA, a
lightweight fine-tuning method, to enable efficient adaptation on limited hardware. Model performance
was assessed using a variety of semantic and lexical similarity criteria, and fine-tuning
considerably improved BERT’s ability to understand subtle phrasal verb meanings. While baseline
performance was moderate, notable improvements were observed across all metrics. Future
work will extend this approach to more diverse datasets, additional multi-word expressions, and
multilingual contexts. |
en_US |