Data structure (Computer science)

Model
Digital Document
Publisher
Florida Atlantic University
Description
Skewed or imbalanced data presents a significant problem for many standard learners which focus on optimizing the overall classification accuracy. When the class distribution is skewed, priority is given to classifying examples from the majority class, at the expense of the often more important minority class. The random forest (RF) classification algorithm, which is a relatively new learner with appealing theoretical properties, has received almost no attention in the context of skewed datasets. This work presents a comprehensive suite of experimentation evaluating the effectiveness of random forests for learning from imbalanced data. Reasonable parameter settings (for the Weka implementation) for ensemble size and number of random features selected are determined through experimentation oil 10 datasets. Further, the application of seven different data sampling techniques that are common methods for handling imbalanced data, in conjunction with RF, is also assessed. Finally, RF is benchmarked against 10 other commonly-used machine learning algorithms, and is shown to provide very strong performance. A total of 35 imbalanced datasets are used, and over one million classifiers are constructed in this work.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Class imbalance tends to cause inferior performance in data mining learners,
particularly with regard to predicting the minority class, which generally imposes
a higher misclassification cost. This work explores the benefits of using genetic
algorithms (GA) to develop classification models which are better able to deal with
the problems encountered when mining datasets which suffer from class imbalance.
Using GA we evolve configuration parameters suited for skewed datasets for three
different learners: artificial neural networks, 0 4.5 decision trees, and RIPPER. We
also propose a novel technique called evolutionary sampling which works to remove
noisy and unnecessary duplicate instances so that the sampled training data will
produce a superior classifier for the imbalanced dataset. Our GA fitness function
uses metrics appropriate for dealing with class imbalance, in particular the area
under the ROC curve. We perform extensive empirical testing on these techniques
and compare the results with seven exist ing sampling methods.