Natural language processing (Computer science)

OCR2SEQ: A NOVEL MULTI-MODAL DATA AUGMENTATION PIPELINE FOR WEAK SUPERVISION

Model

Digital Document

Publisher

Florida Atlantic University

Description

With the recent large-scale adoption of Large Language Models in multidisciplinary research and commercial space, the need for large amounts of labeled data has become more crucial than ever to evaluate potential use cases for opportunities in applied intelligence. Most domain specific fields require a substantial shift that involves extremely large amounts of heterogeneous data to have meaningful impact on the pre-computed weights of most large language models. We explore extending the capabilities a state-of-the-art unsupervised pre-training method; Transformers and Sequential Denoising Auto-Encoder (TSDAE). In this study we show various opportunities for using OCR2Seq a multi-modal generative augmentation strategy to further enhance and measure the quality of noise samples used when using TSDAE as a pretraining task. This study is a first of its kind work that leverages converting both generalized and sparse domains of relational data into multi-modal sources. Our primary objective is measuring the quality of augmentation in relation to the current implementation of the sentence transformers library. Further work includes the effect on ranking, language understanding, and corrective quality.

Member of

FAU Theses and Dissertations

A COMPARATIVE STUDY OF STRUCTURED VERSUS UNSTRUCTURED TEXT DATA

Model

Digital Document

Cardenas, Erika

Khoshgoftaar, Taghi M.

Publisher

Florida Atlantic University

Description

In today’s world, data is generated at an unprecedented rate, and a significant portion of it is unstructured text data. The recent advancements in Natural Language Processing have enabled computers to understand and interpret human language. Data mining techniques were once unable to use text data due to the high dimensionality of text processing models. This limitation was overcome with the ability to represent data as text. This thesis aims to compare the predictive performance of structured versus unstructured text data in two different applications. The first application is in the field of real estate. We compare the performance of tabular real-estate data and unstructured text descriptions of homes to predict the house price. The second application is in translating Electronic Health Records (EHR) tabular data to text data for survival classification of COVID-19 patients. Lastly, we present a range of strategies and perspectives for future research.

Member of

FAU Theses and Dissertations

Context-based Image Concept Detection and Annotation

Model

Digital Document

Zolghadr, Esfandiar

Furht, Borko

Publisher

Florida Atlantic University

Description

Scene understanding attempts to produce a textual description of visible and
latent concepts in an image to describe the real meaning of the scene. Concepts are
either objects, events or relations depicted in an image. To recognize concepts, the
decision of object detection algorithm must be further enhanced from visual
similarity to semantical compatibility. Semantically relevant concepts convey the
most consistent meaning of the scene.
Object detectors analyze visual properties (e.g., pixel intensities, texture, color
gradient) of sub-regions of an image to identify objects. The initially assigned
objects names must be further examined to ensure they are compatible with each
other and the scene. By enforcing inter-object dependencies (e.g., co-occurrence,
spatial and semantical priors) and object to scene constraints as background
information, a concept classifier predicts the most semantically consistent set of
names for discovered objects. The additional background information that describes
concepts is called context.
In this dissertation, a framework for building context-based concept detection is
presented that uses a combination of multiple contextual relationships to refine the
result of underlying feature-based object detectors to produce most semantically compatible concepts.
In addition to the lack of ability to capture semantical dependencies, object
detectors suffer from high dimensionality of feature space that impairs them.
Variances in the image (i.e., quality, pose, articulation, illumination, and occlusion)
can also result in low-quality visual features that impact the accuracy of detected
concepts.
The object detectors used to build context-based framework experiments in this
study are based on the state-of-the-art generative and discriminative graphical
models. The relationships between model variables can be easily described using
graphical models and the dependencies and precisely characterized using these
representations. The generative context-based implementations are extensions of
Latent Dirichlet Allocation, a leading topic modeling approach that is very
effective in reduction of the dimensionality of the data. The discriminative contextbased
approach extends Conditional Random Fields which allows efficient and
precise construction of model by specifying and including only cases that are
related and influence it.
The dataset used for training and evaluation is MIT SUN397. The result of the
experiments shows overall 15% increase in accuracy in annotation and 31%
improvement in semantical saliency of the annotated concepts.

Member of

FAU Theses and Dissertations

An evaluation of machine learning algorithms for tweet sentiment analysis

Model

Digital Document

Prusa, Joseph D.

Khoshgoftaar, Taghi M.

Publisher

Florida Atlantic University

Description

Sentiment analysis of tweets is an application of mining Twitter, and is growing
in popularity as a means of determining public opinion. Machine learning algorithms
are used to perform sentiment analysis; however, data quality issues such as high dimensionality, class imbalance or noise may negatively impact classifier performance.
Machine learning techniques exist for targeting these problems, but have not been
applied to this domain, or have not been studied in detail. In this thesis we discuss
research that has been conducted on tweet sentiment classification, its accompanying
data concerns, and methods of addressing these concerns. We test the impact
of feature selection, data sampling and ensemble techniques in an effort to improve
classifier performance. We also evaluate the combination of feature selection and
ensemble techniques and examine the effects of high dimensionality when combining
multiple types of features. Additionally, we provide strategies and insights for
potential avenues of future work.

Member of

FAU Theses and Dissertations

Natural language processing (Computer science)