Big data

Model
Digital Document
Publisher
Florida Atlantic University
Description
In the modern data landscape, vast amounts of unlabeled data are continuously generated, necessitating development of robust unsupervised techniques for handling unlabeled data. This is the case for fraud detection and healthcare sectors analyses, where data is often significantly imbalanced. This dissertation focuses on novel techniques for handling imbalanced data, with specific emphasis on a novel unsupervised class labeling technique for unlabeled fraud detection datasets and unlabeled cognitive datasets. Traditional supervised machine learning relies on labeled data, which is often expensive and difficult to create, particularly in domains requiring expert input. Additionally, such datasets suffer from challenges associated with class imbalance, where one class has significantly fewer examples than another, complicating model training and significantly reducing performance. The primary objectives of this dissertation include developing a novel unsupervised cleaning method, and an innovative unsupervised class labeling method. We validate and evaluate our methods across various datasets, which include two Medicare fraud detection datasets, a credit card fraud detection dataset, and three datasets used for detecting cognitive decline.
Our unique approach involves using an unsupervised autoencoder to learn from dataset features and synthesize labels. Primarily targeting imbalanced datasets, but still effective for balanced datasets, our method calculates an error metric for each instance. This metric is used to distinguish between fraudulent and legitimate cases, allowing us to assign a binary class label. To further improve label generation, we integrate an unsupervised feature selection method that ranks and identifies the most important features without using class labels.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This dissertation explores one-class classification (OCC) in the context of big data and fraud detection, addressing challenges posed by imbalanced datasets. A detailed survey of OCC-related literature forms a core part of the study, categorizing works into outlier detection, novelty detection, and deep learning applications. This survey reveals a gap in the application of OCC to the inherent problems of big data, such as class rarity and noisy data. Building upon the foundational insights gained from the comprehensive literature review on OCC, the dissertation progresses to a detailed comparative analysis between OCC and binary classification methods. This comparison is pivotal in understanding their respective strengths and limitations across various applications, emphasizing their roles in addressing imbalanced datasets. The research then specifically evaluates binary and OCC using credit card fraud data. This practical application highlights the nuances and effectiveness of these classification methods in real-world scenarios, offering insights into their performance in detecting fraudulent activities. After the evaluation of binary and OCC using credit card fraud data, the dissertation extends this inquiry with a detailed investigation into the effectiveness of both methodologies in fraud detection. This extended analysis involves utilizing not only the Credit Card Fraud Detection Dataset but also the Medicare Part D dataset. The findings show the comparative performance and suitability of these classification methods in practical fraud detection scenarios. Finally, the dissertation examines the impact of training OCC algorithms on majority versus minority classes, using the two previously mentioned datasets in addition to Medicare Part B and Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS) datasets. This exploration offers critical insights into model training strategies and their implications, suggesting that training on the majority class can often lead to more robust classification results. In summary, this dissertation provides a deep understanding of OCC, effectively bridging theoretical concepts with novel applications in big data and fraud detection. It contributes to the field by offering a comprehensive analysis of OCC methodologies, their practical implications, and their effectiveness in addressing class imbalance in big data.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The rapid growth of digital transactions and the increasing sophistication of fraudulent activities have necessitated the development of robust and efficient fraud detection techniques, particularly in the financial and healthcare sectors. This dissertation focuses on the use of novel data reduction techniques for addressing the unique challenges associated with detecting fraud in highly imbalanced Big Data, with a specific emphasis on credit card transactions and Medicare claims. The highly imbalanced nature of these datasets, where fraudulent instances constitute less than one percent of the data, poses significant challenges for traditional machine learning algorithms. This dissertation explores novel data reduction techniques tailored for fraud detection in highly imbalanced Big Data. The primary objectives include developing efficient data preprocessing and feature selection methods to reduce data dimensionality while preserving the most informative features, investigating various machine learning algorithms for their effectiveness in handling imbalanced data, and evaluating the proposed techniques on real-world credit card and Medicare fraud datasets.
This dissertation covers a comprehensive examination of datasets, learners, experimental methodology, sampling techniques, feature selection techniques, and hybrid techniques. Key contributions include the analysis of performance metrics in the context of newly available Big Medicare Data, experiments using Big Medicare data, application of a novel ensemble supervised feature selection technique, and the combined application of data sampling and feature selection. The research demonstrates that, across both domains, the combined application of random undersampling and ensemble feature selection significantly improves classification performance.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In recent years, Florida State recorded thousands of abnormal traffic flows on highways that were caused by traffic incidents. Highway traffic congestion costed the US economy 101 billion dollars in 2020. Therefore, it is imperative to develop effective real-time traffic flow prediction schemes to mitigate the impact of traffic congestion. In this dissertation, we utilized real-life highway segment-based traffic and incident data obtained from Florida Department of Transportation (FDOT) for real-time incident prediction.
We used eight years of FDOT real-life traffic and incident data for Florida I-95 highway to build prediction models for traffic accident severity. Accurate severity prediction is beneficial for responders since it allows the emergency center to dispatch the right number of vehicles without wasting additional resources.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Access to affordable healthcare is a nationwide concern that impacts most of the United States population. Medicare is a federal government healthcare program that aims to provide affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that inevitably raises premiums and costs taxpayers billions of dollars each year. Dedicated task forces investigate the most severe fraudulent cases, but with millions of healthcare providers and more than 60 million active Medicare beneficiaries, manual fraud detection efforts are not able to make widespread, meaningful impact. Through the proliferation of electronic health records and continuous breakthroughs in data mining and machine learning, there is a great opportunity to develop and leverage advanced machine learning systems for automating healthcare fraud detection.
This dissertation identifies key challenges associated with predictive modeling for large-scale Medicare fraud detection and presents innovative solutions to address these challenges in order to provide state-of-the-art results on multiple real-world Medicare fraud data sets. Our methodology for curating nine distinct Medicare fraud classification data sets is presented with comprehensive details describing data accumulation, data pre-processing, data aggregation techniques, data enrichment strategies, and improved fraud labeling. Data-level and algorithm-level methods for treating severe class imbalance, including a flexible output thresholding method and a cost-sensitive framework, are evaluated using deep neural network and ensemble learners. Novel encoding techniques and representation learning methods for high-dimensional categorical features are proposed to create expressive representations of provider attributes and billing procedure codes.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The proliferation of Internet of Things (IoT) devices in various networks is being matched by an increase in related cybersecurity risks. To help counter these risks, big datasets such as Bot-IoT were designed to train machine learning algorithms on network-based intrusion detection for IoT devices. From a binary classification perspective, there is a high-class imbalance in Bot-IoT between each of the attack categories and the normal category, and also between the combined attack categories and the normal category. Within the scope of predicting botnet attacks in IoT networks, this dissertation demonstrates the usefulness and efficiency of novel machine learning methods, such as an easy-to-classify method and a unique set of ensemble feature selection techniques. The focus of this work is on the full Bot-IoT dataset, as well as each of the four attack categories of Bot-IoT, namely, Denial-of-Service (DoS), Distributed Denial-of-Service (DDoS), Reconnaissance, and Information Theft. Since resources and services become inaccessible during DoS and DDoS attacks, this interruption is costly to an organization in terms of both time and money. Reconnaissance attacks often signify the first stage of a cyberattack and preventing them from occurring usually means the end of the intended cyberattack. Information Theft attacks not only erode consumer confidence but may also compromise intellectual property and national security. For the DoS experiment, the ensemble feature selection approach led to the best performance, while for the DDoS experiment, the full set of Bot-IoT features resulted in the best performance. Regarding the Reconnaissance experiment, the ensemble feature selection approach effected the best performance. In relation to the Information Theft experiment, the ensemble feature selection techniques did not affect performance, positively or negatively. However, the ensemble feature selection approach is recommended for this experiment because feature reduction eases computational burden and may provide clarity through improved data visualization. For the full Bot-IoT big dataset, an explainable machine learning approach was taken using the Decision Tree classifier. An easy-to-learn Decision Tree model for predicting attacks was obtained with only three features, which is a significant result for big data.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Machine learning is having an increased impact on the Cyber Security landscape. The ability for predictive models to accurately identify attack patterns in security data is set to overtake more traditional detection methods. Industry demand has led to an uptick in research in the application of machine learning for Cyber Security. To facilitate this research many datasets have been created and made public. This thesis provides an in-depth analysis of one of the newest datasets, Bot-IoT. The full dataset contains about 73 million instances (big data), 3 dependent features, and 43 independent features. The purpose of this thesis is to provide researchers with a foundational understanding of Bot-IoT, its development, its features, its composition, and its pitfalls. It will also summarize many of the published works that utilize Bot-IoT and will propose new areas of research based on the issues identified in the current research and in the dataset.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Emergency Management Information Systems (EMIS) are defined as a set of tools that aid decision-makers in risk assessment and response for significant multi-hazard threats and disasters. Over the past three decades, EMIS have grown in importance as a major component for understanding, managing, and governing transportation-related systems. To increase resilience against potential threats, the main goal of EMIS is to timely utilize spatial and network datasets about (1) locations of hazard areas (2) shelters and resources, (3) and how to respond to emergencies. The main concern about these datasets has always been the very large size, variety, and update rate required to ensure the timely delivery of useful emergency information and response for disastrous events. Another key issue is that the information should be concise and easy to understand, but at the same time very descriptive and useful in the case of emergency or disaster. Advancement in EMIS is urgently needed to develop fundamental data processing components for advanced spatial network queries that clearly and succinctly deliver critical information in emergencies. To address these challenges, we investigate Spatial Network Database Systems and study three challenging Transportation Resilience problems: producing large scale evacuation plans, identifying major traffic patterns during emergency evacuations, and identifying the highest areas in need of resources.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Hospital readmission rates are considered to be an important indicator of quality of care because they may be a consequence of actions of commission or omission made during the initial hospitalization of the patient, or as a consequence of poorly managed transition of the patient back into the community. The negative impact on patient quality of life and huge burden on healthcare system have made reducing hospital readmissions a central goal of healthcare delivery and payment reform efforts.
In this study, we will be proposing a framework on how the readmission analysis and other healthcare models could be deployed in real world and a Machine learning based solution which uses patients discharge summaries as a dataset to train and test the machine learning model created. Current systems does not take into consideration one of the very important aspect of solving readmission problem by taking Big data into consideration. This study also takes into consideration Big data aspect of solutions which can be deployed in the field for real world use. We have used HPCC compute platform which provides distributed parallel programming platform to create, run and manage applications which involves large amount of data. We have also proposed some feature engineering and data balancing techniques which have shown to greatly enhance the machine learning model performance. This was achieved by reducing the dimensionality in the data and fixing the imbalance in the dataset.
The system presented in this study provides a real world machine learning based predictive modeling for reducing readmissions which could be templatized for other diseases.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The integrity of network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. Also, many detection methods for popular network attacks have been developed using outdated or non-representative attack data. To effectively develop modern detection methodologies, there exists a need to acquire data that can fully encompass the behaviors of persistent and emerging threats. When collecting modern day network traffic for intrusion detection, substantial amounts of traffic can be collected, much of which consists of relatively few attack instances as compared to normal traffic. This skewed distribution between normal and attack data can lead to high levels of class imbalance. Machine learning techniques can be used to aid in attack detection, but large levels of imbalance between normal (majority) and attack (minority) instances can lead to inaccurate detection results.