Machine learning.

Model
Digital Document
Publisher
Florida Atlantic University
Description
One of the de ning characteristics of the modern Internet is its massive connectedness,
with information and human connection simply a few clicks away. Social
media and online retailers have revolutionized how we communicate and purchase
goods or services. User generated content on the web, through social media, plays
a large role in modern society; Twitter has been in the forefront of political discourse,
with politicians choosing it as their platform for disseminating information,
while websites like Amazon and Yelp allow users to share their opinions on products
via online reviews. The information available through these platforms can provide
insight into a host of relevant topics through the process of machine learning. Speci -
cally, this process involves text mining for sentiment analysis, which is an application
domain of machine learning involving the extraction of emotion from text.
Unfortunately, there are still those with malicious intent and with the changes
to how we communicate and conduct business, comes changes to their malicious practices.
Social bots and fake reviews plague the web, providing incorrect information
and swaying the opinion of unaware readers. The detection of these false users or
posts from reading the text is di cult, if not impossible, for humans. Fortunately, text mining provides us with methods for the detection of harmful user generated
content.
This dissertation expands the current research in sentiment analysis, fake online
review detection and election prediction. We examine cross-domain sentiment
analysis using tweets and reviews. Novel techniques combining ensemble and feature
selection methods are proposed for the domain of online spam review detection. We
investigate the ability for the Twitter platform to predict the United States 2016 presidential
election. In addition, we determine how social bots in
uence this prediction.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Deep Learning is an increasingly important subdomain of arti cial intelligence.
Deep Learning architectures, arti cial neural networks characterized by having both
a large breadth of neurons and a large depth of layers, bene ts from training on Big
Data. The size and complexity of the model combined with the size of the training
data makes the training procedure very computationally and temporally expensive.
Accelerating the training procedure of Deep Learning using cluster computers faces
many challenges ranging from distributed optimizers to the large communication overhead
speci c to a system with o the shelf networking components. In this thesis, we
present a novel synchronous data parallel distributed Deep Learning implementation
on HPCC Systems, a cluster computer system. We discuss research that has been
conducted on the distribution and parallelization of Deep Learning, as well as the
concerns relating to cluster environments. Additionally, we provide case studies that
evaluate and validate our implementation.
Model
Digital Document
Publisher
Florida Atlantic University
Description
A traditional machine learning environment is characterized by the training
and testing data being drawn from the same domain, therefore, having similar distribution
characteristics. In contrast, a transfer learning environment is characterized
by the training data having di erent distribution characteristics from the testing
data. Previous research on transfer learning has focused on the development and
evaluation of transfer learning algorithms using real-world datasets. Testing with
real-world datasets exposes an algorithm to a limited number of data distribution
di erences and does not exercise an algorithm's full capability and boundary limitations.
In this research, we de ne, implement, and deploy a transfer learning test
framework to test machine learning algorithms. The transfer learning test framework
is designed to create a wide-range of distribution di erences that are typically encountered
in a transfer learning environment. By testing with many di erent distribution
di erences, an algorithm's strong and weak points can be discovered and evaluated
against other algorithms.
This research additionally performs case studies that use the transfer learning
test framework. The rst case study focuses on measuring the impact of exposing algorithms to the Domain Class Imbalance distortion pro le. The next case study
uses the entire transfer learning test framework to evaluate both transfer learning
and traditional machine learning algorithms. The nal case study uses the transfer
learning test framework in conjunction with real-world datasets to measure the impact
of the base traditional learner on the performance of transfer learning algorithms.
Two additional experiments are performed that are focused on using unique realworld
datasets. The rst experiment uses transfer learning techniques to predict
fraudulent Medicare claims. The second experiment uses a heterogeneous transfer
learning method to predict phishing webgages. These case studies will be of interest to
researchers who develop and improve transfer learning algorithms. This research will
also be of bene t to machine learning practitioners in the selection of high-performing
transfer learning algorithms.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Identifying and tracking individuals affected by this virus in densely
populated areas is a unique and an urgent challenge in the public health sector.
Currently, mapping the spread of the Ebola virus is done manually, however with
the help of social contact networks we can model dynamic graphs and predictive
diffusion models of Ebola virus based on the impact on either a specific person or
a specific community.
With the help of this model, we can make more precise forward
predictions of the disease propagations and to identify possibly infected
individuals which will help perform trace – back analysis to locate the possible
source of infection for a social group. This model will visualize and identify the
families and tightly connected social groups who have had contact with an Ebola
patient and is a proactive approach to reduce the risk of exposure of Ebola
spread within a community or geographic location.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The Internet and computer networks have become an important part of our
organizations and everyday life. With the increase in our dependence on computers
and communication networks, malicious activities have become increasingly prevalent.
Network attacks are an important problem in today’s communication environments.
The network traffic must be monitored and analyzed to detect malicious activities
and attacks to ensure reliable functionality of the networks and security of users’
information. Recently, machine learning techniques have been applied toward the
detection of network attacks. Machine learning models are able to extract similarities
and patterns in the network traffic. Unlike signature based methods, there is no need
for manual analyses to extract attack patterns. Applying machine learning algorithms
can automatically build predictive models for the detection of network attacks.
This dissertation reports an empirical analysis of the usage of machine learning
methods for the detection of network attacks. For this purpose, we study the detection
of three common attacks in computer networks: SSH brute force, Man In The Middle
(MITM) and application layer Distributed Denial of Service (DDoS) attacks. Using
outdated and non-representative benchmark data, such as the DARPA dataset, in the intrusion detection domain, has caused a practical gap between building detection
models and their actual deployment in a real computer network. To alleviate this
limitation, we collect representative network data from a real production network for
each attack type. Our analysis of each attack includes a detailed study of the usage
of machine learning methods for its detection. This includes the motivation behind
the proposed machine learning based detection approach, the data collection process,
feature engineering, building predictive models and evaluating their performance.
We also investigate the application of feature selection in building detection models
for network attacks. Overall, this dissertation presents a thorough analysis on how
machine learning techniques can be used to detect network attacks. We not only study
a broad range of network attacks, but also study the application of different machine
learning methods including classification, anomaly detection and feature selection for
their detection at the host level and the network level.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Effective decision support plays vital roles in people's daily life, as well as for
professional practitioners such as health care providers. Without correct information
and timely derived knowledge, a decision is often suboptimal and may result in signi
cant nancial loss or compromises of the performance. In this dissertation, we
study text mining and topic modeling and propose to use text mining methods, in
combination with topic models, to discover knowledge from texts popularly available
from a wide variety of sources, such as research publications, news, medical diagnose
notes, and further employ discovered knowledge to assist social and medical decision
support. Examples of such decisions include hospital patient readmission prediction,
which is a national initiative for health care cost reduction, academic research topics
discovery and trend modeling, and social preference modeling for friend recommendation
in social networks etc.
To carry out text mining, our research, in Chapter 3, first emphasizes on single
document analyzing to investigate textual stylometric features for user pro ling and
recognition. Our research confirms that by using properly designed features, it is
possible to identify the authors who wrote the article, using a number of sample articles written by the author as the training data. This study serves as the base to
assert that text mining is a powerful tool for capturing knowledge in texts for better
decision making.
In the Chapter 4, we advance our research from single documents to documents
with interdependency relationships, and propose to model and predict citation
relationship between documents. Given a collection of documents with known linkage
relationships, our research will discover e ective features to train prediction models,
and predict the likelihood of two documents involving a citation relationships. This
study will help accurately model social network linkage relationships, and can be used
to assist e ective decision making for friend recommendation in social networking, and
reference recommendation in scienti c writing etc.
In the Chapter 5, we advance a topic discovery and trend prediction principle
to discover meaningful topics from a set of data collection, and further model the
evolution trend of the topic. By proposing techniques to discover topics from text,
and using temporal correlation between trend for prediction, our techniques can be
used to summarize a large collection of documents as meaningful topics, and further
forecast the popularity of the topic in a near future. This study can help design
systems to discover popular topics in social media, and further assist resource planning
and scheduling based on the discovered topics and the their evolution trend.
In the Chapter 6, we employ both text mining and topic modeling to the
medical domain for effective decision making. The goal is to discover knowledge from
medical notes to predict the risk of a patient being re-admitted in a near future.
Our research emphasizes on the challenge that re-admitted patients are only a small
portion of the patient population, although they bring signficant financial loss. As
a result, the datasets are highly imbalanced which often result in poor accuracy for
decision making. Our research will propose to use latent topic modeling to carryout
localized sampling, and combine models trained from multiple copies of sampled data for accurate prediction. This study can be directly used to assist hospital re-admission
assessment for early warning and decision support.
The text mining and topic modeling techniques investigated in the dissertation
can be applied to many other domains, involving texts and social relationships,
towards pattern and knowledge based e ective decision making.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Developments in advanced technologies, such as DNA microarrays, have generated
tremendous amounts of data available to researchers in the field of bioinformatics.
These state-of-the-art technologies present not only unprecedented opportunities to
study biological phenomena of interest, but significant challenges in terms of processing
the data. Furthermore, these datasets inherently exhibit a number of challenging
characteristics, such as class imbalance, high dimensionality, small dataset size, noisy
data, and complexity of data in terms of hard to distinguish decision boundaries
between classes within the data.
In recognition of the aforementioned challenges, this dissertation utilizes a variety
of machine-learning and data-mining techniques, such as ensemble classification
algorithms in conjunction with data sampling and feature selection techniques to alleviate
these problems, while improving the classification results of models built on
these datasets. However, in building classification models researchers and practitioners
encounter the challenge that there is not a single classifier that performs relatively
well in all cases. Thus, numerous classification approaches, such as ensemble learning
methods, have been developed to address this problem successfully in a majority of circumstances. Ensemble learning is a promising technique that generates multiple
classification models and then combines their decisions into a single final result.
Ensemble learning often performs better than single-base classifiers in performing
classification tasks.
This dissertation conducts thorough empirical research by implementing a series
of case studies to evaluate how ensemble learning techniques can be utilized to
enhance overall classification performance, as well as improve the generalization ability
of ensemble models. This dissertation investigates ensemble learning techniques
of the boosting, bagging, and random forest algorithms, and proposes a number of
modifications to the existing ensemble techniques in order to improve further the
classification results. This dissertation examines the effectiveness of ensemble learning
techniques on accounting for challenging characteristics of class imbalance and
difficult-to-learn class decision boundaries. Next, it looks into ensemble methods
that are relatively tolerant to class noise, and not only can account for the problem
of class noise, but improves classification performance. This dissertation also examines
the joint effects of data sampling along with ensemble techniques on whether
sampling techniques can further improve classification performance of built ensemble
models.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Sentiment analysis of tweets is an application of mining Twitter, and is growing
in popularity as a means of determining public opinion. Machine learning algorithms
are used to perform sentiment analysis; however, data quality issues such as high dimensionality, class imbalance or noise may negatively impact classifier performance.
Machine learning techniques exist for targeting these problems, but have not been
applied to this domain, or have not been studied in detail. In this thesis we discuss
research that has been conducted on tweet sentiment classification, its accompanying
data concerns, and methods of addressing these concerns. We test the impact
of feature selection, data sampling and ensemble techniques in an effort to improve
classifier performance. We also evaluate the combination of feature selection and
ensemble techniques and examine the effects of high dimensionality when combining
multiple types of features. Additionally, we provide strategies and insights for
potential avenues of future work.