Electronic data processing--Quality control

Data Quality in Data Mining and Machine Learning

Model

Digital Document

Publisher

Florida Atlantic University

Description

With advances in data storage and data transmission technologies, and given
the increasing use of computers by both individuals and corporations, organizations
are accumulating an ever-increasing amount of information in data warehouses and
databases. The huge surge in data, however, has made the process of extracting useful,
actionable, and interesting knowled_qe from the data extremely difficult. In response
to the challenges posed by operating in a data-intensive environment, the fields of data
mining and machine learning (DM/ML) have successfully provided solutions to help
uncover knowledge buried within data.
DM/ML techniques use automated (or semi-automated) procedures to process
vast quantities of data in search of interesting patterns. DM/ML techniques do not
create knowledge, instead the implicit assumption is that knowledge is present within
the data, and these procedures are needed to uncover interesting, important, and previously
unknown relationships. Therefore, the quality of the data is absolutely critical
in ensuring successful analysis. Having high quality data, i.e., data which is (relatively)
free from errors and suitable for use in data mining tasks, is a necessary precondition
for extracting useful knowledge.
In response to the important role played by data quality, this dissertation investigates
data quality and its impact on DM/ML. First, we propose several innovative
procedures for coping with low quality data. Another aspect of data quality, the occurrence
of missing values, is also explored. Finally, a detailed experimental evaluation
on learning from noisy and imbalanced datasets is provided, supplying valuable insight
into how class noise in skewed datasets affects learning algorithms.

Member of

FAU Theses and Dissertations