Lowe, Michael A. | fau.isle.flvc.org

OCR2SEQ: A NOVEL MULTI-MODAL DATA AUGMENTATION PIPELINE FOR WEAK SUPERVISION

Model

Digital Document

Publisher

Florida Atlantic University

Description

With the recent large-scale adoption of Large Language Models in multidisciplinary research and commercial space, the need for large amounts of labeled data has become more crucial than ever to evaluate potential use cases for opportunities in applied intelligence. Most domain specific fields require a substantial shift that involves extremely large amounts of heterogeneous data to have meaningful impact on the pre-computed weights of most large language models. We explore extending the capabilities a state-of-the-art unsupervised pre-training method; Transformers and Sequential Denoising Auto-Encoder (TSDAE). In this study we show various opportunities for using OCR2Seq a multi-modal generative augmentation strategy to further enhance and measure the quality of noise samples used when using TSDAE as a pretraining task. This study is a first of its kind work that leverages converting both generalized and sparse domains of relational data into multi-modal sources. Our primary objective is measuring the quality of augmentation in relation to the current implementation of the sentence transformers library. Further work includes the effect on ranking, language understanding, and corrective quality.

Member of

FAU Theses and Dissertations