Khoshgoftaar, Taghi M.

Relationships
Member of: Thesis advisor
Person Preferred Name
Khoshgoftaar, Taghi M.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Class imbalance is a frequent problem found in bioinformatics datasets. Unfortunately,
the minority class is usually also the class of interest. One of the methods to improve this
situation is data sampling. There are a number of different data sampling methods, each with
their own strengths and weaknesses, which makes choosing one a difficult prospect. In our work
we compare three data sampling techniques Random Undersampling, Random Oversampling,
and SMOTE on six bioinformatics datasets with varying levels of class imbalance. Additionally,
we apply two different classifiers to the problem 5-NN and SVM, and use feature selection to
reduce our datasets to 25 features prior to applying sampling. Our results show that there is very
little difference between the data sampling techniques, although Random Undersampling is the
most frequent top performing data sampling technique for both of our classifiers. We also
performed statistical analysis which confirms that there is no statistical difference between the
techniques. Therefore, our recommendation is to use Random Undersampling when choosing a
data sampling technique, because it is less computationally expensive to implement than
SMOTE and it also reduces the size of the dataset, which will improve subsequent computational
costs without sacrificing classification performance.
Model
Digital Document
Publisher
Florida Atlantic University
Description
With advances in data storage and data transmission technologies, and given
the increasing use of computers by both individuals and corporations, organizations
are accumulating an ever-increasing amount of information in data warehouses and
databases. The huge surge in data, however, has made the process of extracting useful,
actionable, and interesting knowled_qe from the data extremely difficult. In response
to the challenges posed by operating in a data-intensive environment, the fields of data
mining and machine learning (DM/ML) have successfully provided solutions to help
uncover knowledge buried within data.
DM/ML techniques use automated (or semi-automated) procedures to process
vast quantities of data in search of interesting patterns. DM/ML techniques do not
create knowledge, instead the implicit assumption is that knowledge is present within
the data, and these procedures are needed to uncover interesting, important, and previously
unknown relationships. Therefore, the quality of the data is absolutely critical
in ensuring successful analysis. Having high quality data, i.e., data which is (relatively)
free from errors and suitable for use in data mining tasks, is a necessary precondition
for extracting useful knowledge.
In response to the important role played by data quality, this dissertation investigates
data quality and its impact on DM/ML. First, we propose several innovative
procedures for coping with low quality data. Another aspect of data quality, the occurrence
of missing values, is also explored. Finally, a detailed experimental evaluation
on learning from noisy and imbalanced datasets is provided, supplying valuable insight
into how class noise in skewed datasets affects learning algorithms.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The Internet and computer networks have become an important part of our
organizations and everyday life. With the increase in our dependence on computers
and communication networks, malicious activities have become increasingly prevalent.
Network attacks are an important problem in today’s communication environments.
The network traffic must be monitored and analyzed to detect malicious activities
and attacks to ensure reliable functionality of the networks and security of users’
information. Recently, machine learning techniques have been applied toward the
detection of network attacks. Machine learning models are able to extract similarities
and patterns in the network traffic. Unlike signature based methods, there is no need
for manual analyses to extract attack patterns. Applying machine learning algorithms
can automatically build predictive models for the detection of network attacks.
This dissertation reports an empirical analysis of the usage of machine learning
methods for the detection of network attacks. For this purpose, we study the detection
of three common attacks in computer networks: SSH brute force, Man In The Middle
(MITM) and application layer Distributed Denial of Service (DDoS) attacks. Using
outdated and non-representative benchmark data, such as the DARPA dataset, in the intrusion detection domain, has caused a practical gap between building detection
models and their actual deployment in a real computer network. To alleviate this
limitation, we collect representative network data from a real production network for
each attack type. Our analysis of each attack includes a detailed study of the usage
of machine learning methods for its detection. This includes the motivation behind
the proposed machine learning based detection approach, the data collection process,
feature engineering, building predictive models and evaluating their performance.
We also investigate the application of feature selection in building detection models
for network attacks. Overall, this dissertation presents a thorough analysis on how
machine learning techniques can be used to detect network attacks. We not only study
a broad range of network attacks, but also study the application of different machine
learning methods including classification, anomaly detection and feature selection for
their detection at the host level and the network level.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Developments in advanced technologies, such as DNA microarrays, have generated
tremendous amounts of data available to researchers in the field of bioinformatics.
These state-of-the-art technologies present not only unprecedented opportunities to
study biological phenomena of interest, but significant challenges in terms of processing
the data. Furthermore, these datasets inherently exhibit a number of challenging
characteristics, such as class imbalance, high dimensionality, small dataset size, noisy
data, and complexity of data in terms of hard to distinguish decision boundaries
between classes within the data.
In recognition of the aforementioned challenges, this dissertation utilizes a variety
of machine-learning and data-mining techniques, such as ensemble classification
algorithms in conjunction with data sampling and feature selection techniques to alleviate
these problems, while improving the classification results of models built on
these datasets. However, in building classification models researchers and practitioners
encounter the challenge that there is not a single classifier that performs relatively
well in all cases. Thus, numerous classification approaches, such as ensemble learning
methods, have been developed to address this problem successfully in a majority of circumstances. Ensemble learning is a promising technique that generates multiple
classification models and then combines their decisions into a single final result.
Ensemble learning often performs better than single-base classifiers in performing
classification tasks.
This dissertation conducts thorough empirical research by implementing a series
of case studies to evaluate how ensemble learning techniques can be utilized to
enhance overall classification performance, as well as improve the generalization ability
of ensemble models. This dissertation investigates ensemble learning techniques
of the boosting, bagging, and random forest algorithms, and proposes a number of
modifications to the existing ensemble techniques in order to improve further the
classification results. This dissertation examines the effectiveness of ensemble learning
techniques on accounting for challenging characteristics of class imbalance and
difficult-to-learn class decision boundaries. Next, it looks into ensemble methods
that are relatively tolerant to class noise, and not only can account for the problem
of class noise, but improves classification performance. This dissertation also examines
the joint effects of data sampling along with ensemble techniques on whether
sampling techniques can further improve classification performance of built ensemble
models.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Sentiment analysis of tweets is an application of mining Twitter, and is growing
in popularity as a means of determining public opinion. Machine learning algorithms
are used to perform sentiment analysis; however, data quality issues such as high dimensionality, class imbalance or noise may negatively impact classifier performance.
Machine learning techniques exist for targeting these problems, but have not been
applied to this domain, or have not been studied in detail. In this thesis we discuss
research that has been conducted on tweet sentiment classification, its accompanying
data concerns, and methods of addressing these concerns. We test the impact
of feature selection, data sampling and ensemble techniques in an effort to improve
classifier performance. We also evaluate the combination of feature selection and
ensemble techniques and examine the effects of high dimensionality when combining
multiple types of features. Additionally, we provide strategies and insights for
potential avenues of future work.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In response to the massive amounts of data that make up a large number of bioinformatics datasets, it has become increasingly necessary for researchers to use computers to aid them in their endeavors. With difficulties such as high dimensionality, class imbalance, noisy data, and difficult to learn class boundaries, being present within the data, bioinformatics datasets are a challenge to work with. One potential source of assistance is the domain of data mining and machine learning, a field which focuses on working with these large amounts of data and develops techniques to discover new trends and patterns that are hidden within the data and to increases the capability of researchers and practitioners to work with this data. Within this domain there are techniques designed to eliminate irrelevant or redundant features, balance the membership of the classes, handle errors found in the data, and build predictive models for future data.
Model
Digital Document
Publisher
Florida Atlantic University
Description
One of the main applications of machine learning in bioinformatics is the construction of classification models which can accurately classify new instances using information gained from previous instances. With the help of machine learning algorithms (such as supervised classification and gene selection) new meaningful knowledge can be extracted from bioinformatics datasets that can help in disease diagnosis and prognosis as well as in prescribing the right treatment for a disease. One particular challenge encountered when analyzing bioinformatics datasets is data noise, which refers to incorrect or missing values in datasets. Noise can be introduced as a result of experimental errors (e.g. faulty microarray chips, insufficient resolution, image corruption, and incorrect laboratory procedures), as well as other errors (errors
during data processing, transfer, and/or mining). A special type of data noise
called class noise, which occurs when an instance/example is mislabeled. Previous
research showed that class noise has a detrimental impact on machine learning algorithms (e.g. worsened classification performance and unstable feature selection). In
addition to data noise, gene expression datasets can suffer from the problems of high
dimensionality (a very large feature space) and class imbalance (unequal distribution
of instances between classes). As a result of these inherent problems, constructing accurate classification models becomes more challenging.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In recent years more and more researchers have begun to use data mining and
machine learning tools to analyze gene microarray data. In this thesis we have collected a
selection of datasets revolving around prediction of patient response in the specific area
of breast cancer treatment. The datasets collected in this paper are all obtained from gene
chips, which have become the industry standard in measurement of gene expression. In
this thesis we will discuss the methods and procedures used in the studies to analyze the
datasets and their effects on treatment prediction with a particular interest in the selection
of genes for predicting patient response. We will also analyze the datasets on our own in
a uniform manner to determine the validity of these datasets in terms of learning potential
and provide strategies for future work which explore how to best identify gene signatures.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In the field of machine prognostics, vibration analysis is a proven method for
detecting and diagnosing bearing faults in rotating machines. One popular method
for interpreting vibration signals is envelope demodulation, which allows a technician
to clearly identify an impulsive fault source and its severity. However incipient faults -faults in early stages - are masked by in-band noise, which can make the associated impulses difficult to detect and interpret. In this thesis, Wavelet De-Noising (WDN) is implemented after envelope-demodulation to improve accuracy of bearing fault diagnostics. This contrasts the typical approach of de-noising as a preprocessing step.
When manually measuring time-domain impulse amplitudes, the algorithm
shows varying improvements in Signal-to-Noise Ratio (SNR) relative to background
vibrational noise. A frequency-domain measure of SNR agrees with this result.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Uninterruptable Power Supply (UPS) systems have become essential to modern
industries that require continuous power supply to manage critical operations. Since a
failure of a single battery will affect the entire backup system, UPS systems providers
must replace any battery before it runs dead. In this regard, automated monitoring tools
are required to determine when a battery needs replacement. Nowadays, a primitive
method for monitoring the battery backup system is being used for this task. This thesis
presents a classification model that uses data mining cleansing and processing techniques
to remove useless information from the data obtained from the sensors installed in the
batteries in order to improve the quality of the data and determine at a given moment in
time if a battery should be replaced or not. This prediction model will help UPS systems
providers increase the efficiency of battery monitoring procedures.