Identify the Characteristics for Categorizing Documents based  on Different Writing Styles

Karunarathna, K.M.G.S.; Rupasingha, R.A.H.M.; Kumara, B.T.G.S.

dc.contributor.author	Karunarathna, K.M.G.S.
dc.contributor.author	Rupasingha, R.A.H.M.
dc.contributor.author	Kumara, B.T.G.S.
dc.date.accessioned	2024-12-12T06:36:14Z
dc.date.available	2024-12-12T06:36:14Z
dc.date.issued	2023-12-05
dc.identifier.citation	13th Annual Research Session of the Sabaragamuwa University of Sri Lanka	en_US
dc.identifier.isbn	978-624-5727-41-4
dc.identifier.uri	http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/4624
dc.description.abstract	As technology advances, more people are being persuaded to use the internet to acquire information. On the internet, people may find a wide range of documents, including scholarly papers, academic books, reports, research articles, etc. However, in general, web papers are not logically organized, which makes it difficult and time-consuming to get pertinent information from a website. Therefore, a particular study was accomplished to classify the documents based on formal and informal writing styles considering their characteristics. There are linguistic variations that are specific to each style that may be used to determine if a document is formal or informal. Before creating this model perceived the characteristics of the informal and formal styles. In this study, we focused on 15 characteristics, and currently, 6 characteristics are considered namely, Colloquialism, Abbreviation, Contraction, Voice, Modal Verbs, and Phrasal Verbs. Used 5000 data sets for this experiment as formal news articles, informal letters and personal blogs. Pre-processed them using four steps, such as tokenization, stop word removal, lowercasing, and lemmatization, and used four feature extractions methods: Tf-Idf, Word2Vec, Doc2Vec, and Glove. For contraction create an algorithm to find out how many contractions are included using seven rules. Modal verbs and phrasal verbs are also counted in every document and identify the passive voice and active voice separately. And also identifies the abbreviation and colloquialism by classifying the documents using two target variables. For the classification process Artificial Neural Network (ANN) and Long Short-Term Memory (LSTM) algorithms. Considering the abbreviation characteristic doc2vec showed highest accuracy in the ANN algorithm and for the colloquialism characteristic doc2vec showed the highest accuracy in the LSTM algorithm. In this approach, six features have been completed, while the remaining nine are being worked on. Finally, planning to classify documents as formal or informal, combines all of the findings.	en_US
dc.description.sponsorship	ATA INTERNATIONAL LTD and Ceydigital	en_US
dc.language.iso	en	en_US
dc.publisher	Sabaragamuwa University of Sri Lanka, Belihuloya.	en_US
dc.subject	Classification	en_US
dc.subject	Formal writing style	en_US
dc.subject	Informal writing style	en_US
dc.subject	Linguistic variation	en_US
dc.subject	Machine learning	en_US
dc.title	Identify the Characteristics for Categorizing Documents based on Different Writing Styles	en_US
dc.type	Other	en_US