Data mining--Quality control

Model
Digital Document
Publisher
Florida Atlantic University
Description
The software development process is an incremental and iterative activity.
Source code is constantly altered to reflect changing requirements, to respond to
testing results, and to address problem reports. Proper software measurement that
derives meaningful numeric values for some attributes of a software product or
process can help in identifying problem areas and development bottlenecks. Impact
analysis is the evaluation of the risks associated with change requests or problem
reports, including estimates of effects on resources, effort, and schedule. This thesis
presents a methodology called VITA for applying software analysis techniques to
configuration management repository data with the aim of identifying the impact on
file changes due to change requests and problem reports. The repository data can be
analyzed and visualized in a semi-automated manner according to user-selectable
criteria. The approach is illustrated with a model problem concerning software
process improvement of an embedded software system in the context of performing
high-quality software maintenance.
Model
Digital Document
Publisher
Florida Atlantic University
Description
With advances in data storage and data transmission technologies, and given
the increasing use of computers by both individuals and corporations, organizations
are accumulating an ever-increasing amount of information in data warehouses and
databases. The huge surge in data, however, has made the process of extracting useful,
actionable, and interesting knowled_qe from the data extremely difficult. In response
to the challenges posed by operating in a data-intensive environment, the fields of data
mining and machine learning (DM/ML) have successfully provided solutions to help
uncover knowledge buried within data.
DM/ML techniques use automated (or semi-automated) procedures to process
vast quantities of data in search of interesting patterns. DM/ML techniques do not
create knowledge, instead the implicit assumption is that knowledge is present within
the data, and these procedures are needed to uncover interesting, important, and previously
unknown relationships. Therefore, the quality of the data is absolutely critical
in ensuring successful analysis. Having high quality data, i.e., data which is (relatively)
free from errors and suitable for use in data mining tasks, is a necessary precondition
for extracting useful knowledge.
In response to the important role played by data quality, this dissertation investigates
data quality and its impact on DM/ML. First, we propose several innovative
procedures for coping with low quality data. Another aspect of data quality, the occurrence
of missing values, is also explored. Finally, a detailed experimental evaluation
on learning from noisy and imbalanced datasets is provided, supplying valuable insight
into how class noise in skewed datasets affects learning algorithms.