Khoshgoftaar, Taghi M.

Relationships
Member of: Thesis advisor
Person Preferred Name
Khoshgoftaar, Taghi M.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Machine learning techniques such as deep neural networks have become an indispensable tool for a wide range of applications such as image classification, speech recognition, and sentiment analysis in text. An activation function is a mathematical equation that determines the output of each neuron in the neural network. In deep learning architectures the choice of activation functions is very important to the network’s performance. Activation functions determine the output of the model, its computational efficiency, and its ability to train and converge after multiple iterations of training epochs. The selection of an activation function is critical to building and training an effective and efficient neural network. In real-world applications of deep neural networks, the activation function is a hyperparameter. We have observed a lack of consensus on how to select a good activation function for a deep neural network, and that a specific function may not be suitable for all domain-specific applications.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The integrity of network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. Also, many detection methods for popular network attacks have been developed using outdated or non-representative attack data. To effectively develop modern detection methodologies, there exists a need to acquire data that can fully encompass the behaviors of persistent and emerging threats. When collecting modern day network traffic for intrusion detection, substantial amounts of traffic can be collected, much of which consists of relatively few attack instances as compared to normal traffic. This skewed distribution between normal and attack data can lead to high levels of class imbalance. Machine learning techniques can be used to aid in attack detection, but large levels of imbalance between normal (majority) and attack (minority) instances can lead to inaccurate detection results.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Recent technological developments have engendered an expeditious production of big data and also enabled machine learning algorithms to produce high-performance models from such data. Nonetheless, class imbalance (in binary classifications) between the majority and minority classes in big data can skew the predictive performance of the classification algorithms toward the majority (negative) class whereas the minority (positive) class usually holds greater value for the decision makers. Such bias may lead to adverse consequences, some of them even life-threatening, when the existence of false negatives is generally costlier than false positives. The size of the minority class can vary from fair to extraordinary small, which can lead to different performance scores for machine learning algorithms. Class imbalance is a well-studied area for traditional data, i.e., not big data. However, there is limited research focusing on both rarity and severe class imbalance in big data.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized models to predict risk of developing the cancer. Cancer risk models built from structured clinical data are limited in current research, with most studies involving just a few variables from institutional databases or registries. This dissertation presents data processing and machine learning approaches to build melanoma risk models from a large database of de-identified electronic health records. The database contains consistently captured structured data, enabling the extraction of hundreds of thousands of data points each from millions of patient records. Several experiments are performed to build effective models, particularly to predict sentinel lymph node metastasis in known melanoma patients and to predict individual risk of developing melanoma. Data for these models suffer from high dimensionality and class imbalance. Thus, classifiers such as logistic regression, support vector machines, random forest, and XGBoost are combined with advanced modeling techniques such as feature selection and data sampling. Risk factors are evaluated using regression model weights and decision trees, while personalized predictions are provided through random forest decomposition and Shapley additive explanations. Random undersampling on the melanoma risk dataset shows that many majority samples can be removed without a decrease in model performance. To determine how much data is truly needed, we explore learning curve approximation methods on the melanoma data and three publicly-available large-scale biomedical datasets. We apply an inverse power law model as well as introduce a novel semi-supervised curve creation method that utilizes a small amount of labeled data.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the elderly population continues to rise, increasing the need for programs, such as Medicare, to help with associated medical expenses. Unfortunately, due to healthcare fraud, these programs are being adversely affected, draining resources and reducing the quality and accessibility of necessary healthcare services. In response, advanced data analytics have recently been explored to detect possible fraudulent activities. The Centers for Medicare and Medicaid Services (CMS) released several ‘Big Data’ Medicare claims datasets for different parts of their Medicare program to help facilitate this effort. In this dissertation, we employ three CMS Medicare Big Data datasets to evaluate the fraud detection performance available using advanced data analytics techniques, specifically machine learning. We use two distinct approaches, designated as anomaly detection and traditional fraud detection, where each have very distinct data processing and feature engineering. Anomaly detection experiments classify by provider specialty, determining whether outlier physicians within the same specialty signal fraudulent behavior. Traditional fraud detection refers to the experiments directly classifying physicians as fraudulent or non-fraudulent, leveraging machine learning algorithms to discriminate between classes. We present our novel data engineering approaches for both anomaly detection and traditional fraud detection including data processing, fraud mapping, and the creation of a combined dataset consisting of all three Medicare parts. We incorporate the List of Excluded Individuals and Entities database to identify real world fraudulent physicians for model evaluation. Regarding features, the final datasets for anomaly detection contain only claim counts for every procedure a physician submits while traditional fraud detection incorporates aggregated counts and payment information, specialty, and gender. Additionally, we compare cross-validation to the real world application of building a model on a training dataset and evaluating on a separate test dataset for severe class imbalance and rarity.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with class imbalanced big data. Following an in-depth survey of deep learning methods for addressing class imbalance, we evaluate various methods for addressing imbalance on the task of detecting Medicare fraud, a big data problem characterized by extreme class imbalance. Case studies herein demonstrate the impact of class imbalance on neural networks, evaluate the efficacy of data-level and algorithm-level methods, and achieve state-of-the-art results on the given Medicare data set. Results indicate that combining under-sampling and over-sampling maximizes both performance and efficiency.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This Thesis surveys the landscape of Data Augmentation for image datasets. Completing this survey inspired further study into a method of generative modeling known as Generative Adversarial Networks (GANs). A survey on GANs was conducted to understood recent developments and the problems related to training them. Following this survey, four experiments were proposed to test the application of GANs for data augmentation and to contribute to the quality improvement in GAN-generated data. Experimental results demonstrate the effectiveness of GAN-generated data as a pre-training metric. The other experiments discuss important characteristics of GAN models such as the refining of prior information, transferring generative models from large datasets to small data, and automating the design of Deep Neural Networks within the context of the GAN framework. This Thesis will provide readers with a complete introduction to Data Augmentation and Generative Adversarial Networks, as well as insights into the future of these techniques.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Healthcare is an integral component in peoples lives, especially for the rising elderly population, and must be affordable. The United States Medicare program is vital in serving the needs of the elderly. The growing number of people enrolled in the Medicare program, along with the enormous volume of money involved, increases the appeal for, and risk of, fraudulent activities. For many real-world applications, including Medicare fraud, the interesting observations tend to be less frequent than the normative observations. This difference between the normal observations and
those observations of interest can create highly imbalanced datasets. The problem of class imbalance, to include the classification of rare cases indicating extreme class
imbalance, is an important and well-studied area in machine learning. The effects of class imbalance with big data in the real-world Medicare fraud application domain, however, is limited. In particular, the impact of detecting fraud in Medicare claims is critical in lessening the financial and personal impacts of these transgressions. Fortunately, the healthcare domain is one such area where the successful detection
of fraud can garner meaningful positive results. The application of machine learning techniques, plus methods to mitigate the adverse effects of class imbalance and rarity, can be used to detect fraud and lessen the impacts for all Medicare beneficiaries. This dissertation presents the application of machine learning approaches to detect Medicare provider claims fraud in the United States. We discuss novel techniques
to process three big Medicare datasets and create a new, combined dataset, which includes mapping fraud labels associated with known excluded providers. We investigate the ability of machine learning techniques, unsupervised and supervised, to detect Medicare claims fraud and leverage data sampling methods to lessen the impact of class imbalance and increase fraud detection performance. Additionally, we extend the study of class imbalance to assess the impacts of rare cases in big data for Medicare fraud detection.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Many current application domains of machine learning and arti cial intelligence
involve knowledge discovery from text, such as sentiment analysis, document
ontology, and spam detection. Humans have years of experience and training with
language, enabling them to understand complicated, nuanced text passages with relative
ease. A text classi er attempts to emulate or replicate this knowledge so that
computers can discriminate between concepts encountered in text; however, learning
high-level concepts from text, such as those found in many applications of text classi-
cation, is a challenging task due to the many challenges associated with text mining
and classi cation. Recently, classi ers trained using arti cial neural networks have
been shown to be e ective for a variety of text mining tasks. Convolutional neural
networks have been trained to classify text from character-level input, automatically
learn high-level abstract representations and avoiding the need for human engineered
features.
This dissertation proposes two new techniques for character-level learning,
log(m) character embedding and convolutional window classi cation. Log(m) embedding
is a new character-vector representation for text data that is more compact and memory e cient than previous embedding vectors. Convolutional window classi
cation is a technique for classifying long documents, i.e. documents with lengths
exceeding the input dimension of the neural network. Additionally, we investigate the
performance of convolutional neural networks combined with long short-term memory
networks, explore how document length impacts classi cation performance and
compare performance of neural networks against non-neural network-based learners
in text classi cation tasks.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The population of people ages 65 and older has increased since the 1960s
and current estimates indicate it will double by 2060. Medicare is a federal health
insurance program for people 65 or older in the United States. Medicare claims
fraud and abuse is an ongoing issue that wastes a large amount of money every year
resulting in higher health care costs and taxes for everyone. In this study, an empirical
evaluation of several unsupervised machine learning approaches is performed which
indicates reasonable fraud detection results. We employ two unsupervised machine
learning algorithms, Isolation Forest and Unsupervised Random Forest, which have
not been previously used for the detection of fraud and abuse on Medicare data.
Additionally, we implement three other machine learning methods previously applied
on Medicare data which include: Local Outlier Factor, Autoencoder, and k-Nearest
Neighbor. For our dataset, we combine the 2012 to 2015 Medicare provider utilization
and payment data and add fraud labels from the List of Excluded Individuals/Entities
(LEIE) database. Results show that Local Outlier Factor is the best model to use for
Medicare fraud detection.