Data mining

Model
Digital Document
Publisher
Florida Atlantic University
Description
Health data analysis has emerged as a critical domain with immense potential to revolutionize healthcare delivery, disease management, and medical research. However, it is confronted by formidable challenges, including sample bias, data privacy concerns, and the cost and scarcity of labeled data. These challenges collectively impede the development of accurate and robust machine learning models for various healthcare applications, from disease diagnosis to treatment recommendations.
Sample bias and specificity refer to the inherent challenges in working with health datasets that may not be representative of the broader population or may exhibit disparities in their distributions. These biases can significantly impact the generalizability and effectiveness of machine learning models in healthcare, potentially leading to suboptimal outcomes for certain patient groups. Data privacy and locality are paramount concerns in the era of digital health records and wearable devices. The need to protect sensitive patient information while still extracting valuable insights from these data sources poses a delicate balancing act. Moreover, the geographic and jurisdictional differences in data regulations further complicate the use of health data in a global context. Label cost and scarcity pertain to the often labor-intensive and expensive process of obtaining ground-truth labels for supervised learning tasks in healthcare. The limited availability of labeled data can hinder the development and deployment of machine learning models, particularly in specialized medical domains.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Class imbalance tends to cause inferior performance in data mining learners,
particularly with regard to predicting the minority class, which generally imposes
a higher misclassification cost. This work explores the benefits of using genetic
algorithms (GA) to develop classification models which are better able to deal with
the problems encountered when mining datasets which suffer from class imbalance.
Using GA we evolve configuration parameters suited for skewed datasets for three
different learners: artificial neural networks, 0 4.5 decision trees, and RIPPER. We
also propose a novel technique called evolutionary sampling which works to remove
noisy and unnecessary duplicate instances so that the sampled training data will
produce a superior classifier for the imbalanced dataset. Our GA fitness function
uses metrics appropriate for dealing with class imbalance, in particular the area
under the ROC curve. We perform extensive empirical testing on these techniques
and compare the results with seven exist ing sampling methods.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The visualization of recent episodes regarding apparently unjustifiable deaths of minorities, caused by police and federal law enforcement agencies, has been amplified through today's social media and television networks. Such events may seem to imply that issues concerning racial inequalities in America are getting worse. However, we do not know whether such indications are factual; whether this is a recent phenomenon, whether racial inequality is escalating relative to earlier decades, or whether it is better in certain regions of the nation compared to others. We have built a semantic engine for the purpose of querying statistics on various metropolitan areas, based on a database of individual deaths. Separately, we have built a database of demographic data on poverty, income, education attainment, and crime statistics for the top 25 most populous metropolitan areas. These data will ultimately be combined with government data to evaluate this hyp othesis, and provide a tool for predictive analytics. In this thesis, we will provide preliminary results in that direction. The methodology in our research consisted of multiple steps. We initially described our requirements and drew data from numerous datasets, which contained information on the 23 highest populated Metropolitan Statistical Areas in the United States. After all of the required data was obtained we decomposed the Metropolitan Statistical Area records into domain components and created an Ontology/Taxonomy via Protege to determine an hierarchy level of nouns towards identifying significant keywords throughout the datasets to use as search queries. Next, we used a Semantic Web implementation accompanied with Python programming language, and FuXi to build and instantiate a vocabulary. The Ontology was then parsed for the entered search query and returned corresponding results providing a semantically organized a nd relevant output in RDF/XML format.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Classification algorithms represent a rich set of tools, which train a classification model from a given training and test set, to classify previously unseen test instances. Although existing methods have studied classification algorithm performance with respect to feature selection, noise condition, and sample distributions, our existing studies have not addressed an important issue on the classification algorithm performance relating to feature deletion and addition. In this thesis, we carry out sensitive study of classification algorithms by using feature deletion and addition. Three types of classifiers: (1) weak classifiers; (2) generic and strong classifiers; and (3) ensemble classifiers are validated on three types of data (1) feature dimension data, (2) gene expression data and (3) biomedical document data. In the experiments, we continuously add redundant features to the training and test set in order to observe the classification algorithm performance, and also continuously remove features to find the performance of the underlying
classifiers. Our studies draw a number of important findings, which will help data mining and machine learning community under the genuine performance of common classification algorithms on real-world data.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Molecular dynamics is a computer simulation technique for expressing the
ultimate details of individual particle motions and can be used in many fields, such as
chemical physics, materials science, and the modeling of biomolecules. In this thesis, we
study visualization and pattern mining in molecular dynamics simulation. The molecular
data set has a large number of atoms in each frame and range of frames. The features of
the data set include atom ID; frame number; position in x, y, and z plane; charge; and
mass. The three main challenges of this thesis are to display a larger number of atoms and
range of frames, to visualize this large data set in 3-dimension, and to cluster the
abnormally shifting atoms that move with the same pace and direction in different frames.
Focusing on these three challenges, there are three contributions of this thesis. First, we
design an abnormal pattern mining and visualization framework for molecular dynamics
simulation. The proposed framework can visualize the clusters of abnormal shifting atom
groups in a three-dimensional space, and show their temporal relationships. Second, we propose a pattern mining method to detect abnormal atom groups which share similar
movement and have large variance compared to the majority atoms. We propose a
general molecular dynamics simulation tool, which can visualize a large number of atoms,
including their movement and temporal relationships, to help domain experts study
molecular dynamics simulation results. The main functions for this visualization and
pattern mining tool include atom number, cluster visualization, search across different
frames, multiple frame range search, frame range switch, and line demonstration for atom
motions in different frames. Therefore, this visualization and pattern mining tool can be
used in the field of chemical physics, materials science, and the modeling of
biomolecules for the molecular dynamic simulation outcomes.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Imbalanced class distributions typically cause poor classifier performance on the minority class, which also tends to be the class with the highest cost of mis-classification. Data sampling is a common solution to this problem, and numerous sampling techniques have been proposed to address it. Prior research examining the performance of these techniques has been narrow and limited. This work uses thorough empirical experimentation to compare the performance of seven existing data sampling techniques using five different classifiers and four different datasets. The work addresses which sampling techniques produce the best performance in the presence of class unbalance, which classifiers are most robust to the problem, as well as which sampling techniques perform better or worse with each classifier. Extensive statistical analysis of these results is provided, in addition to an examination of the qualitative effects of the sampling techniques on the types of predictions made by the C4.5 classifier.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The security of wireless networks has gained considerable importance due to the rapid proliferation of wireless communications. While computer network heuristics and rules are being used to control and monitor the security of Wireless Local Area Networks (WLANs), mining and learning behaviors of network users can provide a deeper level of security analysis. The objective and contribution of this thesis is three fold: exploring the security vulnerabilities of the IEEE 802.11 standard for wireless networks; extracting features or metrics, from a security point of view, for modeling network traffic in a WLAN; and proposing a data mining-based approach to intrusion detection in WLANs. A clustering- and expert-based approach to intrusion detection in a wireless network is presented in this thesis. The case study data is obtained from a real-word WLAN and contains over one million records. Given the clusters of network traffic records, a distance-based heuristic measure is proposed for labeling clusters as either normal or intrusive. The empirical results demonstrate the promise of the proposed approach, laying the groundwork for a clustering-based framework for intrusion detection in computer networks.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Increasing aggressions through cyber terrorism pose a constant threat to information security in our day to day life. Implementing effective intrusion detection systems (IDSs) is an essential task due to the great dependence on networked computers for the operational control of various infrastructures. Building effective IDSs, unfortunately, has remained an elusive goal owing to the great technical challenges involved, and applied data mining techniques are increasingly being utilized in attempts to overcome the difficulties. This thesis presents a comparative study of the traditional "direct" approaches with the recently explored "indirect" approaches of classification which use class binarization and combiner techniques for intrusion detection. We evaluate and compare the performance of IDSs based on various data mining algorithms, in the context of a well known network intrusion evaluation data set. It is empirically shown that data mining algorithms when applied using the indirect classification approach yield better intrusion detection models.
Model
Digital Document
Publisher
Florida Atlantic University
Description
As network-based computer systems play increasingly vital roles in modern society, they have become the targets of criminals. Network security has never been more important a subject than in today's extensively interconnected computer world. Intrusion Detection Systems (IDS) have been used along with the data mining techniques to detect intrusions. In this thesis, we present a comparative study of intrusion detection using a decision-tree learner (C4.5), two rule-based learners (ripper and ridor), a learner to combine decision trees and rules (PART), and two instance-based learners (IBK and Nnge). We investigate and compare the performance of IDSs based on the six techniques, with respect to a case study of the DAPAR KDD-1999 network intrusion detection project. Investigation results demonstrated that data mining techniques are very useful in the area of intrusion detection.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This thesis refers to a research addressing the use of binary representation of the DNA for the purpose of developing useful algorithms for Bioinformatics. Pertinent studies address the use of a binary form of the DNA base chemicals in information-theoretic base so as to identify symmetry between DNA and complementary DNA. This study also refers to "fuzzy" (codon-noncodon) considerations in delinating codon and noncodon regimes in a DNA sequences. The research envisaged further includes a comparative analysis of the test results on the aforesaid efforts using different statistical metrics such as Hamming distance Kullback-Leibler measure etc. the observed details supports the symmetry aspect between DNA and CDNA strands. It also demonstrates capability of identifying non-codon regions in DNA even under diffused (overlapped) fuzzy states.