Data Science | fau.isle.flvc.org

TOPOLOGICAL DATA ANALYSIS FOR DATA SCIENCE: THE DELAUNAY-RIPS COMPLEX, TRIANGULATION STABILITIES, AND PROTEIN STABILITY PREDICTIONS

Model

Digital Document

Publisher

Florida Atlantic University

Description

Topological Data Analysis (TDA) is a relatively new field of research that utilizes topological notions to extract discriminating features from data. Within TDA, persistent homology (PH) is a robust method to compute multi-dimensional geometric and topological features of a dataset. Because these features are often stable under certain perturbations of the underlying data, are often discriminating, and can be used for visualization of structure in high-dimensional data and in statistical and machine learning modeling, PH has attracted the interest of researchers across scientific disciplines and in many industry applications. However, computational costs may present challenges to effectively using PH in certain data contexts, and theoretical stability results may not hold in practice. In this dissertation, we develop an algorithm that can reduce the computation burden of computing persistent homology on point cloud data. Naming it Delaunay-Rips (DR), we define, implement, and empirically test this computationally tractable simplicial complex construction for computing persistent homology of Euclidean point cloud data. We demonstrate the practical robustness of DR for persistent homology in comparison with other simplical complexes in machine learning applications such as predicting sleep state from patient heart rate. To justify the theoretical stability of DR, we prove the stability of the Delaunay triangulation of a pointcloud P under perturbations of the points of P. Specifically, we impose a notion of genericity on the points of P to ensure stability. In the final chapter, we contribute to the field of computational biology by taking a data-driven approach to learn topological features of designed proteins from their persistence diagrams. We find correlations between the learned topological features and biochemical features to investigate how protein structure relates to features identified by subject-matter experts. We train several machine learning models to assess the performance of incorporating topological features into training with biochemical features. Using cover-tree differencing via entropy reduction (CDER), we identify distinguishing regions of the persistence diagrams of stable/unstable proteins. More notably, we find statistically significant improvement in classification performance (in terms of average precision score) for certain designed secondary structure topologies.

Member of

FAU Theses and Dissertations