Khoshgoftaar, Taghi M.

Relationships
Member of: Thesis advisor
Person Preferred Name
Khoshgoftaar, Taghi M.
Model
Digital Document
Publisher
Florida Atlantic University
Description
One of the de ning characteristics of the modern Internet is its massive connectedness,
with information and human connection simply a few clicks away. Social
media and online retailers have revolutionized how we communicate and purchase
goods or services. User generated content on the web, through social media, plays
a large role in modern society; Twitter has been in the forefront of political discourse,
with politicians choosing it as their platform for disseminating information,
while websites like Amazon and Yelp allow users to share their opinions on products
via online reviews. The information available through these platforms can provide
insight into a host of relevant topics through the process of machine learning. Speci -
cally, this process involves text mining for sentiment analysis, which is an application
domain of machine learning involving the extraction of emotion from text.
Unfortunately, there are still those with malicious intent and with the changes
to how we communicate and conduct business, comes changes to their malicious practices.
Social bots and fake reviews plague the web, providing incorrect information
and swaying the opinion of unaware readers. The detection of these false users or
posts from reading the text is di cult, if not impossible, for humans. Fortunately, text mining provides us with methods for the detection of harmful user generated
content.
This dissertation expands the current research in sentiment analysis, fake online
review detection and election prediction. We examine cross-domain sentiment
analysis using tweets and reviews. Novel techniques combining ensemble and feature
selection methods are proposed for the domain of online spam review detection. We
investigate the ability for the Twitter platform to predict the United States 2016 presidential
election. In addition, we determine how social bots in
uence this prediction.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Deep Learning is an increasingly important subdomain of arti cial intelligence.
Deep Learning architectures, arti cial neural networks characterized by having both
a large breadth of neurons and a large depth of layers, bene ts from training on Big
Data. The size and complexity of the model combined with the size of the training
data makes the training procedure very computationally and temporally expensive.
Accelerating the training procedure of Deep Learning using cluster computers faces
many challenges ranging from distributed optimizers to the large communication overhead
speci c to a system with o the shelf networking components. In this thesis, we
present a novel synchronous data parallel distributed Deep Learning implementation
on HPCC Systems, a cluster computer system. We discuss research that has been
conducted on the distribution and parallelization of Deep Learning, as well as the
concerns relating to cluster environments. Additionally, we provide case studies that
evaluate and validate our implementation.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Digital videos and images are effective media for capturing spatial and ternporal
information in the real world. The rapid growth of digital videos has motivated
research aimed at developing effective algorithms, with the objective of obtaining useful
information for a variety of application areas, such as security, commerce, medicine,
geography, etc. This dissertation presents innovative and practical techniques, based on
statistics and machine learning, that address some key research problems in video and
image analysis, including video stabilization, object classification, image segmentation,
and video indexing.
A novel unsupervised multi-scale color image segmentation algorithm is proposed.
The basic idea is to apply mean shift clustering to obtain an over-segmentation, and
then merge regions at multiple scales to minimize the MDL criterion. The performance
on the Berkeley segmentation benchmark compares favorably with some existing approaches.
This algorithm can also operate on one-dimensional feature vectors representing
each frame in ocean survey videos, which results in a novel framework for building
a hierarchical video index. The advantage is to provide the user with the flexibility
of browsing the videos at arbitrary levels of detail, which makes it more efficient for users to browse a long video in order to find interesting information based on the
hierarchical index. Also, an empirical study on classification of ships in surveillance
videos is presented. A comparative performance study on three classification algorithms is
conducted. Based on this study, an effective feature extraction and classification algorithm
for classifying ships in coastline surveillance videos is proposed. Finally, an empirical
study on video stabilization is presented, which includes a comparative performance study
on four motion estimation methods and three motion correction methods. Based on this
study, an effective real-time video stabilization algorithm for coastline surveillance is
proposed, which involves a novel approach to reduce error accumulation.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In this dissertation we address two significant issues of concern. These are software
quality modeling and data quality assessment. Software quality can be measured by software
reliability. Reliability is often measured in terms of the time between system failures. A
failure is caused by a fault which is a defect in the executable software product. The time
between system failures depends both on the presence and the usage pattern of the software.
Finding faulty components in the development cycle of a software system can lead to a
more reliable final system and will reduce development and maintenance costs. The issue of
software quality is investigated by proposing a new approach, rule-based classification model
(RBCM) that uses rough set theory to generate decision rules to predict software quality.
The new model minimizes over-fitting by balancing the Type I and Type II niisclassiflcation
error rates. We also propose a model selection technique for rule-based models called rulebased
model selection (RBMS). The proposed rule-based model selection technique utilizes
the complete and partial matching rule sets of candidate RBCMs to determine the model
with the least amount of over-fitting. In the experiments that were performed, the RBCMs
were effective at identifying faulty software modules, and the RBMS technique was able to
identify RBCMs that minimized over-fitting. Good data quality is a critical component for building effective software quality models.
We address the significance of the quality of data on the classification performance of learners
by conducting a comprehensive comparative study. Several trends were observed in the
experiments. Class and attribute had the greatest impact on the performance of learners
when it occurred simultaneously in the data. Class noise had a significant impact on the
performance of learners, while attribute noise had no impact when it occurred in less than
40% of the most significant independent attributes. Random Forest (RF100), a group of 100
decision trees, was the most, accurate and robust learner in all the experiments with noisy
data.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Recently most of the research pertaining to Service-Oriented Architecture (SOA) is
based on web services and how secure they are in terms of efficiency and
effectiveness. This requires validation, verification, and evaluation of web services.
Verification and validation should be collaborative when web services from different
vendors are integrated together to carry out a coherent task. For this purpose, novel
model checking technologies have been devised and applied to web services. "Model
Checking" is a promising technique for verification and validation of software
systems. WS-BPEL (Business Process Execution Language for Web Services) is an
emerging standard language to describe web service composition behavior. The
advanced features of BPEL such as concurrency and hierarchy make it challenging to
verify BPEL models. Based on all such factors my thesis surveys a few important technologies (tools) for model checking and comparing each of them based on their
"functional" and "non-functional" properties. The comparison is based on three case
studies (first being the small case, second medium and the third one a large case)
where we construct synthetic web service compositions for each case (as there are not
many publicly available compositions [1]). The first case study is "Enhanced LoanApproval
Process" and is considered a small case. The second is "Enhanced Purchase
Order Process" which is of medium size and the third, and largest is based on a
scientific workflow pattern, called the "Service Oriented Architecture Implementing
BOINC Workflow" based on BOINC (Berkeley Open Infrastructure Network
Computing) architecture.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Skewed or imbalanced data presents a significant problem for many standard learners which focus on optimizing the overall classification accuracy. When the class distribution is skewed, priority is given to classifying examples from the majority class, at the expense of the often more important minority class. The random forest (RF) classification algorithm, which is a relatively new learner with appealing theoretical properties, has received almost no attention in the context of skewed datasets. This work presents a comprehensive suite of experimentation evaluating the effectiveness of random forests for learning from imbalanced data. Reasonable parameter settings (for the Weka implementation) for ensemble size and number of random features selected are determined through experimentation oil 10 datasets. Further, the application of seven different data sampling techniques that are common methods for handling imbalanced data, in conjunction with RF, is also assessed. Finally, RF is benchmarked against 10 other commonly-used machine learning algorithms, and is shown to provide very strong performance. A total of 35 imbalanced datasets are used, and over one million classifiers are constructed in this work.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Ordinal classification refers to an important category of real world problems,
in which the attributes of the instances to be classified and the classes are
linearly ordered. Many applications of machine learning frequently involve
situations exhibiting an order among the different categories represented by
the class attribute. In ordinal classification the class value is converted into a
numeric quantity and regression algorithms are applied to the transformed
data. The data is later translated back into a discrete class value in a postprocessing
step. This thesis is devoted to an empirical study of ordinal and
non-ordinal classification algorithms for intrusion detection in WLANs. We
used ordinal classification in conjunction with nine classifiers for the
experiments in this thesis. All classifiers are parts of the WEKA machinelearning
workbench. The results indicate that most of the classifiers give
similar or better results with ordinal classification compared to non-ordinal
classification.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Class imbalance tends to cause inferior performance in data mining learners,
particularly with regard to predicting the minority class, which generally imposes
a higher misclassification cost. This work explores the benefits of using genetic
algorithms (GA) to develop classification models which are better able to deal with
the problems encountered when mining datasets which suffer from class imbalance.
Using GA we evolve configuration parameters suited for skewed datasets for three
different learners: artificial neural networks, 0 4.5 decision trees, and RIPPER. We
also propose a novel technique called evolutionary sampling which works to remove
noisy and unnecessary duplicate instances so that the sampled training data will
produce a superior classifier for the imbalanced dataset. Our GA fitness function
uses metrics appropriate for dealing with class imbalance, in particular the area
under the ROC curve. We perform extensive empirical testing on these techniques
and compare the results with seven exist ing sampling methods.
Model
Digital Document
Publisher
Florida Atlantic University
Description
A variety of classifiers for solving classification problems is available from
the domain of machine learning. Commonly used classifiers include support vector
machines, decision trees and neural networks. These classifiers can be configured
by modifying internal parameters. The large number of available classifiers and
the different configuration possibilities result in a large number of combinatiorrs of
classifier and configuration settings, leaving the practitioner with the problem of
evaluating the performance of different classifiers. This problem can be solved by
using performance metrics. However, the large number of available metrics causes
difficulty in deciding which metrics to use and when comparing classifiers on the
basis of multiple metrics. This paper uses the statistical method of factor analysis
in order to investigate the relationships between several performance metrics and
introduces the concept of relative performance which has the potential to case the
process of comparing several classifiers. The relative performance metric is also
used to evaluate different support vector machine classifiers and to determine if the
default settings in the Weka data mining tool are reasonable.
Model
Digital Document
Publisher
Florida Atlantic University
Description
A traditional machine learning environment is characterized by the training
and testing data being drawn from the same domain, therefore, having similar distribution
characteristics. In contrast, a transfer learning environment is characterized
by the training data having di erent distribution characteristics from the testing
data. Previous research on transfer learning has focused on the development and
evaluation of transfer learning algorithms using real-world datasets. Testing with
real-world datasets exposes an algorithm to a limited number of data distribution
di erences and does not exercise an algorithm's full capability and boundary limitations.
In this research, we de ne, implement, and deploy a transfer learning test
framework to test machine learning algorithms. The transfer learning test framework
is designed to create a wide-range of distribution di erences that are typically encountered
in a transfer learning environment. By testing with many di erent distribution
di erences, an algorithm's strong and weak points can be discovered and evaluated
against other algorithms.
This research additionally performs case studies that use the transfer learning
test framework. The rst case study focuses on measuring the impact of exposing algorithms to the Domain Class Imbalance distortion pro le. The next case study
uses the entire transfer learning test framework to evaluate both transfer learning
and traditional machine learning algorithms. The nal case study uses the transfer
learning test framework in conjunction with real-world datasets to measure the impact
of the base traditional learner on the performance of transfer learning algorithms.
Two additional experiments are performed that are focused on using unique realworld
datasets. The rst experiment uses transfer learning techniques to predict
fraudulent Medicare claims. The second experiment uses a heterogeneous transfer
learning method to predict phishing webgages. These case studies will be of interest to
researchers who develop and improve transfer learning algorithms. This research will
also be of bene t to machine learning practitioners in the selection of high-performing
transfer learning algorithms.