Data mining

Model
Digital Document
Publisher
Florida Atlantic University
Description
Network security is an important subject in today's extensively interconnected computer world. The industry, academic institutions, small and large businesses and even residences are now greatly at risk from the increasing onslaught of computer attacks. Such malicious efforts cause damage ranging from mere violation of confidentiality and issues of privacy up to actual financial loss if business operations are compromised, or even further, loss of human lives in the case of mission-critical networked computer applications. Intrusion Detection Systems (IDS) have been used along with the help of data mining modeling efforts to detect intruders, yet with the limitation of organizational resources it is unreasonable to inspect every network alarm raised by the IDS. Modified Expected Cost of Misclassification ( MECM) is a model selection measure that is resource-aware and cost-sensitive at the same time, and has proven to be effective for the identification of the best resource-based intrusion detection model.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This study looks at the database technique of data warehousing and data mining to analyze the business problems related to customer churn in the wireless industry. The customer churn due to new industry regulations has hit the wireless industry hard. The study uses data warehousing and data mining to model the customer database to predict churn rates and suggest timely recommendations to increase customer retention and thereby increase overall profitability. The Naive Bayes algorithm for supervised learning was the prediction algorithm used for data modeling in the study. The data set used in the study consists of one hundred thousand real wireless customers. The study uses database tools such as Oracle database with data mining options and JDeveloper for implementing the models. The data model developed with the calibration data was used to predict the churn for the wireless customers along with the predictive accuracy and probabilities of the results.
Model
Digital Document
Publisher
Florida Atlantic University
Description
An un-supervised learning algorithm on application level intrusion detection, named Graph Sequence Learning Algorithm (GSLA), is proposed in this dissertation. Experiments prove its effectiveness. Similar to most intrusion detection algorithms, in GSLA, the normal profile needs to be learned first. The normal profile is built using a session learning method, which is combined with the one-way Analysis of Variance method (ANOVA) to determine the value of an anomaly threshold. In the proposed approach, a hash table is used to store a sparse data matrix in triple data format that is collected from a web transition log instead of an n-by-n dimension matrix. Furthermore, in GSLA, the sequence learning matrix can be dynamically changed according to a different volume of data sets. Therefore, this approach is more efficient, easy to manipulate, and saves memory space. To validate the effectiveness of the algorithm, extensive simulations have been conducted by applying the GSLA algorithm to the homework submission system at our computer science and engineering department. The performance of GSLA is evaluated and compared with traditional Markov Model (MM) and K-means algorithms. Specifically, three major experiments have been done: (1) A small data set is collected as a sample data, and is applied to GSLA, MM, and K-means algorithms to illustrate the operation of the proposed algorithm and demonstrate the detection of abnormal behaviors. (2) The Random Walk-Through sampling method is used to generate a larger sample data set, and the resultant anomaly score is classified into several clusters in order to visualize and demonstrate the normal and abnormal behaviors with K-means and GSLA algorithms. (3) Multiple professors' data sets are collected and used to build the normal profiles, and the ANOVA method is used to test the significant difference among professors' normal profiles. The GSLA algorithm can be made as a module and plugged into the IDS as an anomaly detection system.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This research refers to studies on information-theoretic (IT) aspects of data-sequence patterns and developing thereof discriminant algorithms that enable distinguishing the features of underlying sequence patterns having characteristic, inherent stochastical attributes. The application potentials of such algorithms include bioinformatic data mining efforts. Consistent with the scope of the study as above, considered in this research are specific details on information-theoretics and entropy considerations vis-a-vis sequence patterns (having stochastical attributes) such as DNA sequences of molecular biology. Applying information-theoretic concepts (essentially in Shannon's sense), the following distinct sets of metrics are developed and applied in the algorithms developed for data-sequence pattern-discrimination applications: (i) Divergence or cross-entropy algorithms of Kullback-Leibler type and of general Czizar class; (ii) statistical distance measures; (iii) ratio-metrics; (iv) Fisher type linear-discriminant measure and (v) complexity metric based on information redundancy. These measures are judiciously adopted in ascertaining codon-noncodon delineations in DNA sequences that consist of crisp and/or fuzzy nucleotide domains across their chains. The Fisher measure is also used in codon-noncodon delineation and in motif detection. Relevant algorithms are used to test DNA sequences of human and some bacterial organisms. The relative efficacy of the metrics and the algorithms is determined and discussed. The potentials of such algorithms in supplementing the prevailing methods are indicated. Scope for future studies is identified in terms of persisting open questions.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Machine learning techniques allow useful insight to be distilled from the increasingly massive repositories of data being stored. As these data mining techniques can only learn patterns actually present in the data, it is important that the desired knowledge be faithfully and discernibly contained therein. Two common data quality issues that often affect important real life classification applications are class noise and class imbalance. Class noise, where dependent attribute values are recorded erroneously, misleads a classifier and reduces predictive performance. Class imbalance occurs when one class represents only a small portion of the examples in a dataset, and, in such cases, classifiers often display poor accuracy on the minority class. The reduction in classification performance becomes even worse when the two issues occur simultaneously. To address the magnified difficulty caused by this interaction, this dissertation performs thorough empirical investigations of several techniques for dealing with class noise and imbalanced data. Comprehensive experiments are performed to assess the effects of the classification techniques on classifier performance, as well as how the level of class imbalance, level of class noise, and distribution of class noise among the classes affects results. An empirical analysis of classifier based noise detection efficiency appears first. Subsequently, an intelligent data sampling technique, based on noise detection, is proposed and tested. Several hybrid classifier ensemble techniques for addressing class noise and imbalance are introduced. Finally, a detailed empirical investigation of classification filtering is performed to determine best practices.
Model
Digital Document
Publisher
Florida Atlantic University
Description
An ocean turbine extarcts the kinetic energy from ocean currents to generate electricity. Machine Condition Monitoring (MCM) / Prognostic Health Monitoring (PHM) systems allow for self-checking and automated fault detection, and are integral in the construction of a highly reliable ocean turbine. MCM/PHM systems enable real time health assessment, prognostics and advisory generation by interpreting data from sensors installed on the machine being monitored. To effectively utilize sensor readings for determining the health of individual components, macro-components and the overall system, these measurements must somehow be combined or integrated to form a holistic picture. The process used to perform this combination is called data fusion. Data mining and machine learning techniques allow for the analysis of these sensor signals, any maintenance history and other available information (like expert knowledge) to automate decision making and other such processes within MCM/PHM systems. ... This dissertation proposes an MCM/PHM software architecture employing those techniques which were determined from the experiments to be ideal for this application. Our work also offers a data fusion framework applicable to ocean machinery MCM/PHM. Finally, it presents a software tool for monitoring ocean turbines and other submerged vessels, implemented according to industry standards.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Asset management is a time consuming and error prone process. Information Technology (IT) personnel typically perform this task manually by visually inspecting assets to identify misplaced assets. If this process is automated and provided to IT personnel it would prove very useful in keeping track of assets in a server rack. A mobile based solution is proposed to automate this process. The asset management application on the tablet captures images of assets and searches an annotated database to identify the asset. We evaluate the matching performance and speed of asset matching using three different image feature descriptors. Methods to reduce feature extraction and matching complexity were developed. Performance and accuracy tradeoffs were studied, domain specific problems were identified, and optimizations for mobile platforms were made. The results show that the proposed methods reduce complexity of asset matching by 67% when compared to the matching process using unmodified image feature descriptors.
Model
Digital Document
Publisher
Florida Atlantic University
Description
As computing technology continues to advance, it has become increasingly difficult to find businesses that do not rely, at least in part, upon the collection and analysis of data for the purpose of project management and process improvement. The cost of software tends to increase over time due to its complexity and the cost of employing humans to develop, maintain, and evolve it. To help control the costs, organizations often seek to improve the process by which software systems are developed and evolved. Improvements can be realized by discovering previously unknown or hidden relationships between the artifacts generated as a result of developing a software system. The objective of the work described in this thesis is to provide a visualization tool that helps managers and engineers better plan for future projects by discovering new knowledge gained by synthesizing and visualizing data mined from software repository records from previous projects.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Possibly the largest problem when working in bioinformatics is the large amount of data to sift through to find useful information. This thesis shows that the use of feature selection (a method of removing irrelevant and redundant information from the dataset) is a useful and even necessary technique to use in these large datasets. This thesis also presents a new method in comparing classes to each other through the use of their features. It also provides a thorough analysis of the use of various feature selection techniques and classifier in different scenarios from bioinformatics. Overall, this thesis shows the importance of the use of feature selection in bioinformatics.
Model
Digital Document
Publisher
Florida Atlantic University
Description
One of the greatest challenges to data mining is erroneous or noisy data. Several studies have noted the weak performance of classification models trained from low quality data. This dissertation shows that low quality data can also impact the effectiveness of feature selection, and considers the effect of class noise on various feature ranking techniques. It presents a novel approach to feature ranking based on ensemble learning and assesses these ensemble feature selection techniques in terms of their robustness to class noise. It presents a noise-based stability analysis that measures the degree of agreement between a feature ranking techniques output on a clean dataset versus its outputs on the same dataset but corrupted with different combinations of noise level and noise distribution. It then considers classification performances from models built with a subset of the original features obtained after applying feature ranking techniques on noisy data. It proposes the focused ensemble feature ranking as a noise-tolerant approach to feature selection and compares focused ensembles with general ensembles in terms of the ability of the selected features to withstand the impact of class noise when used to build classification models. Finally, it explores three approaches for addressing the combined problem of high dimensionality and class imbalance. Collectively, this research shows the importance of considering class noise when performing feature selection.