Khoshgoftaar, Taghi M.

Relationships
Member of: Thesis advisor
Person Preferred Name
Khoshgoftaar, Taghi M.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In the modern data landscape, vast amounts of unlabeled data are continuously generated, necessitating development of robust unsupervised techniques for handling unlabeled data. This is the case for fraud detection and healthcare sectors analyses, where data is often significantly imbalanced. This dissertation focuses on novel techniques for handling imbalanced data, with specific emphasis on a novel unsupervised class labeling technique for unlabeled fraud detection datasets and unlabeled cognitive datasets. Traditional supervised machine learning relies on labeled data, which is often expensive and difficult to create, particularly in domains requiring expert input. Additionally, such datasets suffer from challenges associated with class imbalance, where one class has significantly fewer examples than another, complicating model training and significantly reducing performance. The primary objectives of this dissertation include developing a novel unsupervised cleaning method, and an innovative unsupervised class labeling method. We validate and evaluate our methods across various datasets, which include two Medicare fraud detection datasets, a credit card fraud detection dataset, and three datasets used for detecting cognitive decline.
Our unique approach involves using an unsupervised autoencoder to learn from dataset features and synthesize labels. Primarily targeting imbalanced datasets, but still effective for balanced datasets, our method calculates an error metric for each instance. This metric is used to distinguish between fraudulent and legitimate cases, allowing us to assign a binary class label. To further improve label generation, we integrate an unsupervised feature selection method that ranks and identifies the most important features without using class labels.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This dissertation explores one-class classification (OCC) in the context of big data and fraud detection, addressing challenges posed by imbalanced datasets. A detailed survey of OCC-related literature forms a core part of the study, categorizing works into outlier detection, novelty detection, and deep learning applications. This survey reveals a gap in the application of OCC to the inherent problems of big data, such as class rarity and noisy data. Building upon the foundational insights gained from the comprehensive literature review on OCC, the dissertation progresses to a detailed comparative analysis between OCC and binary classification methods. This comparison is pivotal in understanding their respective strengths and limitations across various applications, emphasizing their roles in addressing imbalanced datasets. The research then specifically evaluates binary and OCC using credit card fraud data. This practical application highlights the nuances and effectiveness of these classification methods in real-world scenarios, offering insights into their performance in detecting fraudulent activities. After the evaluation of binary and OCC using credit card fraud data, the dissertation extends this inquiry with a detailed investigation into the effectiveness of both methodologies in fraud detection. This extended analysis involves utilizing not only the Credit Card Fraud Detection Dataset but also the Medicare Part D dataset. The findings show the comparative performance and suitability of these classification methods in practical fraud detection scenarios. Finally, the dissertation examines the impact of training OCC algorithms on majority versus minority classes, using the two previously mentioned datasets in addition to Medicare Part B and Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS) datasets. This exploration offers critical insights into model training strategies and their implications, suggesting that training on the majority class can often lead to more robust classification results. In summary, this dissertation provides a deep understanding of OCC, effectively bridging theoretical concepts with novel applications in big data and fraud detection. It contributes to the field by offering a comprehensive analysis of OCC methodologies, their practical implications, and their effectiveness in addressing class imbalance in big data.
Model
Digital Document
Publisher
Florida Atlantic University
Description
With the recent large-scale adoption of Large Language Models in multidisciplinary research and commercial space, the need for large amounts of labeled data has become more crucial than ever to evaluate potential use cases for opportunities in applied intelligence. Most domain specific fields require a substantial shift that involves extremely large amounts of heterogeneous data to have meaningful impact on the pre-computed weights of most large language models. We explore extending the capabilities a state-of-the-art unsupervised pre-training method; Transformers and Sequential Denoising Auto-Encoder (TSDAE). In this study we show various opportunities for using OCR2Seq a multi-modal generative augmentation strategy to further enhance and measure the quality of noise samples used when using TSDAE as a pretraining task. This study is a first of its kind work that leverages converting both generalized and sparse domains of relational data into multi-modal sources. Our primary objective is measuring the quality of augmentation in relation to the current implementation of the sentence transformers library. Further work includes the effect on ranking, language understanding, and corrective quality.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In today’s world, data is generated at an unprecedented rate, and a significant portion of it is unstructured text data. The recent advancements in Natural Language Processing have enabled computers to understand and interpret human language. Data mining techniques were once unable to use text data due to the high dimensionality of text processing models. This limitation was overcome with the ability to represent data as text. This thesis aims to compare the predictive performance of structured versus unstructured text data in two different applications. The first application is in the field of real estate. We compare the performance of tabular real-estate data and unstructured text descriptions of homes to predict the house price. The second application is in translating Electronic Health Records (EHR) tabular data to text data for survival classification of COVID-19 patients. Lastly, we present a range of strategies and perspectives for future research.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Recent successes of Deep Learning-powered AI are largely due to the trio of: algorithms, GPU computing, and big data. Data could take the shape of hospital records, satellite images, or the text in this paragraph. Deep Learning algorithms typically need massive collections of data before they can make reliable predictions. This limitation inspired investigation into a class of techniques referred to as Data Augmentation. Data Augmentation was originally developed as a set of label-preserving transformations used in order to simulate large datasets from small ones. For example, imagine developing a classifier that categorizes images as either a “cat” or a “dog”. After initial collection and labeling, there may only be 500 of these images, which are not enough data points to train a Deep Learning model. By transforming these images with Data Augmentations such as rotations and brightness modifications, more labeled images are available for model training and classification! In addition to applications for learning from limited labeled data, Data Augmentation can also be used for generalization testing. For example, we can augment the test set to set the visual style of images to “winter” and see how that impacts the performance of a stop sign detector.
The dissertation begins with an overview of Deep Learning methods such as neural network architectures, gradient descent optimization, and generalization testing. Following an initial description of this technology, the dissertation explains overfitting. Overfitting is the crux of Deep Learning methods in which improvements to the training set do not lead to improvements on the testing set. To the rescue are Data Augmentation techniques, of which the Dissertation presents an overview of the augmentations used for both image and text data, as well as the promising potential of generative data augmentation with models such as ChatGPT. The dissertation then describes three major experimental works revolving around CIFAR-10 image classification, language modeling a novel dataset of Keras information, and patient survival classification from COVID-19 Electronic Health Records. The dissertation concludes with a reflection on the evolution of limitations of Deep Learning and directions for future work.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Access to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the most severe fraudulent cases, but with millions of healthcare providers and more than 60 million active Medicare beneficiaries, manual fraud detection efforts are not able to make widespread, meaningful impact. Through the proliferation of electronic health records and continuous breakthroughs in data mining and machine learning, there is a great opportunity to develop and leverage advanced machine learning systems for automating healthcare fraud detection.
This dissertation identifies key challenges associated with predictive modeling for large-scale Medicare fraud detection and presents innovative solutions to address these challenges in order to provide state-of-the-art results on multiple real-world Medicare fraud data sets. Our methodology for curating nine distinct Medicare fraud classification data sets is presented with comprehensive details describing data accumulation, data pre-processing, data aggregation techniques, data enrichment strategies, and improved fraud labeling. Data-level and algorithm-level methods for treating severe class imbalance, including a flexible output thresholding method and a cost-sensitive framework, are evaluated using deep neural network and ensemble learners. Novel encoding techniques and representation learning methods for high-dimensional categorical features are proposed to create expressive representations of provider attributes and billing procedure codes.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The proliferation of Internet of Things (IoT) devices in various networks is being matched by an increase in related cybersecurity risks. To help counter these risks, big datasets such as Bot-IoT were designed to train machine learning algorithms on network-based intrusion detection for IoT devices. From a binary classification perspective, there is a high-class imbalance in Bot-IoT between each of the attack categories and the normal category, and also between the combined attack categories and the normal category. Within the scope of predicting botnet attacks in IoT networks, this dissertation demonstrates the usefulness and efficiency of novel machine learning methods, such as an easy-to-classify method and a unique set of ensemble feature selection techniques. The focus of this work is on the full Bot-IoT dataset, as well as each of the four attack categories of Bot-IoT, namely, Denial-of-Service (DoS), Distributed Denial-of-Service (DDoS), Reconnaissance, and Information Theft. Since resources and services become inaccessible during DoS and DDoS attacks, this interruption is costly to an organization in terms of both time and money. Reconnaissance attacks often signify the first stage of a cyberattack and preventing them from occurring usually means the end of the intended cyberattack. Information Theft attacks not only erode consumer confidence but may also compromise intellectual property and national security. For the DoS experiment, the ensemble feature selection approach led to the best performance, while for the DDoS experiment, the full set of Bot-IoT features resulted in the best performance. Regarding the Reconnaissance experiment, the ensemble feature selection approach effected the best performance. In relation to the Information Theft experiment, the ensemble feature selection techniques did not affect performance, positively or negatively. However, the ensemble feature selection approach is recommended for this experiment because feature reduction eases computational burden and may provide clarity through improved data visualization. For the full Bot-IoT big dataset, an explainable machine learning approach was taken using the Decision Tree classifier. An easy-to-learn Decision Tree model for predicting attacks was obtained with only three features, which is a significant result for big data.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Application-layer based attacks are becoming a more desirable target in computer networks for hackers. From complex rootkits to Denial of Service (DoS) attacks, hackers look to compromise computer networks. Web and application servers can get shut down by various application-layer DoS attacks, which exhaust CPU or memory resources. The HTTP protocol has become a popular target to launch application-layer DoS attacks. These exploits consume less bandwidth than traditional DoS attacks. Furthermore, this type of DoS attack is hard to detect because its network traffic resembles legitimate network requests. Being able to detect these DoS attacks effectively is a critical component of any robust cybersecurity system. Machine learning can help detect DoS attacks by identifying patterns in network traffic. With machine learning methods, predictive models can automatically detect network threats.
This dissertation offers a novel framework for collecting several attack datasets on a live production network, where producing quality representative data is a requirement. Our approach builds datasets from collected Netflow and Full Packet Capture (FPC) data. We evaluate a wide range of machine learning classifiers which allows us to analyze slow DoS detection models more thoroughly. To identify attacks, we look at each dataset's unique traffic patterns and distinguishing properties. This research evaluates and investigates appropriate feature selection evaluators and search strategies. Features are assessed for their predictive value and degree of redundancy to build a subset of features. Feature subsets with high-class correlation but low intercorrelation are favored. Experimental results indicate Netflow and FPC features are discriminating enough to detect DoS attacks accurately. We conduct a comparative examination of performance metrics to determine the capability of several machine learning classifiers. Additionally, we improve upon our performance scores by investigating a variety of feature selection optimization strategies. Overall, this dissertation proposes a novel machine learning approach for detecting slow DoS attacks. Our machine learning results demonstrate that a single subset of features trained on Netflow data can effectively detect slow application-layer DoS attacks.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Machine learning is having an increased impact on the Cyber Security landscape. The ability for predictive models to accurately identify attack patterns in security data is set to overtake more traditional detection methods. Industry demand has led to an uptick in research in the application of machine learning for Cyber Security. To facilitate this research many datasets have been created and made public. This thesis provides an in-depth analysis of one of the newest datasets, Bot-IoT. The full dataset contains about 73 million instances (big data), 3 dependent features, and 43 independent features. The purpose of this thesis is to provide researchers with a foundational understanding of Bot-IoT, its development, its features, its composition, and its pitfalls. It will also summarize many of the published works that utilize Bot-IoT and will propose new areas of research based on the issues identified in the current research and in the dataset.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The Internet has provided humanity with many great benefits, but it has also introduced new risks and dangers. E-commerce and other web portals have become large industries with big data. Criminals and other bad actors constantly seek to exploit these web properties through web attacks. Being able to properly detect these web attacks is a crucial component in the overall cybersecurity landscape. Machine learning is one tool that can assist in detecting web attacks. However, properly using machine learning to detect web attacks does not come without its challenges. Classification algorithms can have difficulty with severe levels of class imbalance. Class imbalance occurs when one class label disproportionately outnumbers another class label. For example, in cybersecurity, it is common for the negative (normal) label to severely outnumber the positive (attack) label. Another difficulty encountered in machine learning is models can be complex, thus making it difficult for even subject matter experts to truly understand a model’s detection process. Moreover, it is important for practitioners to determine which input features to include or exclude in their models for optimal detection performance. This dissertation studies machine learning algorithms in detecting web attacks with big data. Severe class imbalance is a common problem in cybersecurity, and mainstream machine learning research does not sufficiently consider this with web attacks. Our research first investigates the problems associated with severe class imbalance and rarity. Rarity is an extreme form of class imbalance where the positive class suffers extremely low positive class count, thus making it difficult for the classifiers to discriminate. In reducing imbalance, we demonstrate random undersampling can effectively mitigate the class imbalance and rarity problems associated with web attacks. Furthermore, our research introduces a novel feature popularity technique which produces easier to understand models by only including the fewer, most popular features. Feature popularity granted us new insights into the web attack detection process, even though we had already intensely studied it. Even so, we proceed cautiously in selecting the best input features, as we determined that the “most important” Destination Port feature might be contaminated by lopsided traffic distributions.