PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES

File
Publisher
Florida Atlantic University
Date Issued
2019
EDTF Date Created
2019
Description
Melanoma is one of the fastest growing cancers in the world, and can affect patients earlier in life than most other cancers. Therefore, it is imperative to be able to identify patients at high risk for melanoma and enroll them in screening programs to detect the cancer early. Electronic health records collect an enormous amount of data about real-world patient encounters, treatments, and outcomes. This data can be mined to increase our understanding of melanoma as well as build personalized models to predict risk of developing the cancer. Cancer risk models built from structured clinical data are limited in current research, with most studies involving just a few variables from institutional databases or registries. This dissertation presents data processing and machine learning approaches to build melanoma risk models from a large database of de-identified electronic health records. The database contains consistently captured structured data, enabling the extraction of hundreds of thousands of data points each from millions of patient records. Several experiments are performed to build effective models, particularly to predict sentinel lymph node metastasis in known melanoma patients and to predict individual risk of developing melanoma. Data for these models suffer from high dimensionality and class imbalance. Thus, classifiers such as logistic regression, support vector machines, random forest, and XGBoost are combined with advanced modeling techniques such as feature selection and data sampling. Risk factors are evaluated using regression model weights and decision trees, while personalized predictions are provided through random forest decomposition and Shapley additive explanations. Random undersampling on the melanoma risk dataset shows that many majority samples can be removed without a decrease in model performance. To determine how much data is truly needed, we explore learning curve approximation methods on the melanoma data and three publicly-available large-scale biomedical datasets. We apply an inverse power law model as well as introduce a novel semi-supervised curve creation method that utilizes a small amount of labeled data.
Note

Includes bibliography.

Language
Type
Extent
191 p.
Identifier
FA00013342
Additional Information
Includes bibliography.
Dissertation (Ph.D.)--Florida Atlantic University, 2019.
FAU Electronic Theses and Dissertations Collection
Date Backup
2019
Date Created Backup
2019
Date Text
2019
Date Created (EDTF)
2019
Date Issued (EDTF)
2019
Extension


FAU

IID
FA00013342
Person Preferred Name

Richter, Aaron N.

author

Graduate College
Physical Description

application/pdf
191 p.
Title Plain
PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
Use and Reproduction
Copyright © is held by the author with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
http://rightsstatements.org/vocab/InC/1.0/
Origin Information

2019
2019
Florida Atlantic University

Boca Raton, Fla.

Physical Location
Florida Atlantic University Libraries
Place

Boca Raton, Fla.
Sub Location
Digital Library
Title
PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES
Other Title Info

PREDICTING MELANOMA RISK FROM ELECTRONIC HEALTH RECORDS WITH MACHINE LEARNING TECHNIQUES