Abstract:
Document layout analysis is a process of identifying and segmenting various elements within a document. However, accurate digitization and information extraction from documents require effective analysis of complex layouts, particularly in documents with diverse elements such as titles, images, paragraphs, tables, and mathematical expressions. Additionally, comprehensive layout analysis alone is insufficient for document digitization; style analysis plays a critical role in preserving structural and typographical integrity, which is essential for accurate text recognition in Sinhala script. This research proposes an enhanced U-Net architecture for the semantic segmentation of Sinhala document layouts and font styles. A dataset of 600 manually annotated Sinhala document images with 27 labels, including Title Level 1, Title Level 2, Title Level 3, Paragraph, Table, Image, Text Bold, and Text Italic, was used. Furthermore, it improves optical character recognition performance by element-wise integration of optical character recognition technologies, ensuring improved text extraction accuracy. The initial convolutional layers of the U-Net encoder were integrated with a vision transformer block. The input image was divided into patches, which were flattened and processed by the vision transformer block with adaptive positional embedding. The accuracy, precision, recall, and F1-score for the test dataset were 79.27%, 71.12%, 69.85%, and 70.48%, respectively. These modifications enabled the model to capture long-range dependencies and global context in the input images, potentially improving feature extraction. Compared to conventional U-Net models, this approach demonstrated superior segmentation accuracy, particularly in complex document structures. Finally, this study contributes to Sinhala document digitization by providing a comprehensive framework for layout and style analysis, enhancing OCR performance, and offering adaptability for multilingual document processing in real-world applications such as automated archiving and digital library systems.