OCR2SEQ: A NOVEL MULTI-MODAL DATA AUGMENTATION PIPELINE FOR WEAK SUPERVISION

File
Publisher
Florida Atlantic University
Date Issued
2023
EDTF Date Created
2023
Description
With the recent large-scale adoption of Large Language Models in multidisciplinary research and commercial space, the need for large amounts of labeled data has become more crucial than ever to evaluate potential use cases for opportunities in applied intelligence. Most domain specific fields require a substantial shift that involves extremely large amounts of heterogeneous data to have meaningful impact on the pre-computed weights of most large language models. We explore extending the capabilities a state-of-the-art unsupervised pre-training method; Transformers and Sequential Denoising Auto-Encoder (TSDAE). In this study we show various opportunities for using OCR2Seq a multi-modal generative augmentation strategy to further enhance and measure the quality of noise samples used when using TSDAE as a pretraining task. This study is a first of its kind work that leverages converting both generalized and sparse domains of relational data into multi-modal sources. Our primary objective is measuring the quality of augmentation in relation to the current implementation of the sentence transformers library. Further work includes the effect on ranking, language understanding, and corrective quality.
Note

Includes bibliography.

Language
Type
Extent
63 p.
Identifier
FA00014367
Rights

Copyright © is held by the author with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.

Additional Information
Includes bibliography.
Thesis (MS)--Florida Atlantic University, 2023.
FAU Electronic Theses and Dissertations Collection
Date Backup
2023
Date Created Backup
2023
Date Text
2023
Date Created (EDTF)
2023
Date Issued (EDTF)
2023
Extension


FAU

IID
FA00014367
Person Preferred Name

Lowe, Michael A.

author

Graduate College
Physical Description

application/pdf
63 p.
Title Plain
OCR2SEQ: A NOVEL MULTI-MODAL DATA AUGMENTATION PIPELINE FOR WEAK SUPERVISION
Use and Reproduction
Copyright © is held by the author with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Origin Information

2023
2023
Florida Atlantic University

Boca Raton, Fla.

Place

Boca Raton, Fla.
Title
OCR2SEQ: A NOVEL MULTI-MODAL DATA AUGMENTATION PIPELINE FOR WEAK SUPERVISION
Other Title Info

OCR2SEQ: A NOVEL MULTI-MODAL DATA AUGMENTATION PIPELINE FOR WEAK SUPERVISION