Identify the Characteristics for Categorizing Documents based  on Different Writing Styles

Karunarathna, K.M.G.S.; Rupasingha, R.A.H.M.; Kumara, B.T.G.S.

Identify the Characteristics for Categorizing Documents based on Different Writing Styles

Karunarathna, K.M.G.S.; Rupasingha, R.A.H.M.; Kumara, B.T.G.S.

URI: http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/4624

Date: 2023-12-05

Abstract:

As technology advances, more people are being persuaded to use the internet to acquire information. On the internet, people may find a wide range of documents, including scholarly papers, academic books, reports, research articles, etc. However, in general, web papers are not logically organized, which makes it difficult and time-consuming to get pertinent information from a website. Therefore, a particular study was accomplished to classify the documents based on formal and informal writing styles considering their characteristics. There are linguistic variations that are specific to each style that may be used to determine if a document is formal or informal. Before creating this model perceived the characteristics of the informal and formal styles. In this study, we focused on 15 characteristics, and currently, 6 characteristics are considered namely, Colloquialism, Abbreviation, Contraction, Voice, Modal Verbs, and Phrasal Verbs. Used 5000 data sets for this experiment as formal news articles, informal letters and personal blogs. Pre-processed them using four steps, such as tokenization, stop word removal, lowercasing, and lemmatization, and used four feature extractions methods: Tf-Idf, Word2Vec, Doc2Vec, and Glove. For contraction create an algorithm to find out how many contractions are included using seven rules. Modal verbs and phrasal verbs are also counted in every document and identify the passive voice and active voice separately. And also identifies the abbreviation and colloquialism by classifying the documents using two target variables. For the classification process Artificial Neural Network (ANN) and Long Short-Term Memory (LSTM) algorithms. Considering the abbreviation characteristic doc2vec showed highest accuracy in the ANN algorithm and for the colloquialism characteristic doc2vec showed the highest accuracy in the LSTM algorithm. In this approach, six features have been completed, while the remaining nine are being worked on. Finally, planning to classify documents as formal or informal, combines all of the findings.

Show full item record

Files in this item

Name: Abstact Book_ARS ...

Size: 249.2Kb

Format: PDF

Description: ARS-2023-61

View/Open

This item appears in the following Collection(s)

ARS 2023 [89]
Abstracts of the 13th Annual Research Session, Sabaragamuwa University of Sri Lanka

Identify the Characteristics for Categorizing Documents based on Different Writing Styles

Identify the Characteristics for Categorizing Documents based on Different Writing Styles

Abstract:

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account