Sabaragamuwa University of Sri Lanka

Identify the Characteristics for Categorizing Documents based on Different Writing Styles

Show simple item record

dc.contributor.author Karunarathna, K.M.G.S.
dc.contributor.author Rupasingha, R.A.H.M.
dc.contributor.author Kumara, B.T.G.S.
dc.date.accessioned 2024-12-12T06:36:14Z
dc.date.available 2024-12-12T06:36:14Z
dc.date.issued 2023-12-05
dc.identifier.citation 13th Annual Research Session of the Sabaragamuwa University of Sri Lanka en_US
dc.identifier.isbn 978-624-5727-41-4
dc.identifier.uri http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/4624
dc.description.abstract As technology advances, more people are being persuaded to use the internet to acquire information. On the internet, people may find a wide range of documents, including scholarly papers, academic books, reports, research articles, etc. However, in general, web papers are not logically organized, which makes it difficult and time-consuming to get pertinent information from a website. Therefore, a particular study was accomplished to classify the documents based on formal and informal writing styles considering their characteristics. There are linguistic variations that are specific to each style that may be used to determine if a document is formal or informal. Before creating this model perceived the characteristics of the informal and formal styles. In this study, we focused on 15 characteristics, and currently, 6 characteristics are considered namely, Colloquialism, Abbreviation, Contraction, Voice, Modal Verbs, and Phrasal Verbs. Used 5000 data sets for this experiment as formal news articles, informal letters and personal blogs. Pre-processed them using four steps, such as tokenization, stop word removal, lowercasing, and lemmatization, and used four feature extractions methods: Tf-Idf, Word2Vec, Doc2Vec, and Glove. For contraction create an algorithm to find out how many contractions are included using seven rules. Modal verbs and phrasal verbs are also counted in every document and identify the passive voice and active voice separately. And also identifies the abbreviation and colloquialism by classifying the documents using two target variables. For the classification process Artificial Neural Network (ANN) and Long Short-Term Memory (LSTM) algorithms. Considering the abbreviation characteristic doc2vec showed highest accuracy in the ANN algorithm and for the colloquialism characteristic doc2vec showed the highest accuracy in the LSTM algorithm. In this approach, six features have been completed, while the remaining nine are being worked on. Finally, planning to classify documents as formal or informal, combines all of the findings. en_US
dc.description.sponsorship ATA INTERNATIONAL LTD and Ceydigital en_US
dc.language.iso en en_US
dc.publisher Sabaragamuwa University of Sri Lanka, Belihuloya. en_US
dc.subject Classification en_US
dc.subject Formal writing style en_US
dc.subject Informal writing style en_US
dc.subject Linguistic variation en_US
dc.subject Machine learning en_US
dc.title Identify the Characteristics for Categorizing Documents based on Different Writing Styles en_US
dc.type Other en_US


Files in this item

This item appears in the following Collection(s)

  • ARS 2023 [89]
    Abstracts of the 13th Annual Research Session, Sabaragamuwa University of Sri Lanka

Show simple item record

Search DSpace


Advanced Search

Browse

My Account