Data Analysis

Model
Digital Document
Publisher
Florida Atlantic University
Description
The aim of this dissertation is to achieve a thorough understanding and develop an algorithmic framework for a crucial aspect of autonomous and artificial intelligence (AI) systems: Data Analysis. In the current era of AI and machine learning (ML), ”data” holds paramount importance. For effective learning tasks, it is essential to ensure that the training dataset is accurate and comprehensive. Additionally, during system operation, it is vital to identify and address faulty data to prevent potentially catastrophic system failures. Our research in data analysis focuses on creating new mathematical theories and algorithms for outlier-resistant matrix decomposition using L1-norm principal component analysis (PCA). L1-norm PCA has demonstrated robustness against irregular data points and will be pivotal for future AI learning and autonomous system operations.
This dissertation presents a comprehensive exploration of L1-norm techniques and their diverse applications. A summary of our contributions in this manuscript follows: Chapter 1 establishes the foundational mathematical notation and linear algebra concepts critical for the subsequent discussions, along with a review of the complexities of the current state-of-the-art in L1-norm matrix decomposition algorithms. In Chapter 2, we address the L1-norm error decomposition problem by introducing a novel method called ”Individual L1-norm-error Principal Component Computation by 3-layer Perceptron” (Perceptron L1 error). Extensive studies demonstrate the efficiency of this greedy L1-norm PC calculator.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Topological Data Analysis (TDA) is a relatively new field of research that utilizes topological notions to extract discriminating features from data. Within TDA, persistent homology (PH) is a robust method to compute multi-dimensional geometric and topological features of a dataset. Because these features are often stable under certain perturbations of the underlying data, are often discriminating, and can be used for visualization of structure in high-dimensional data and in statistical and machine learning modeling, PH has attracted the interest of researchers across scientific disciplines and in many industry applications. However, computational costs may present challenges to effectively using PH in certain data contexts, and theoretical stability results may not hold in practice. In this dissertation, we develop an algorithm that can reduce the computation burden of computing persistent homology on point cloud data. Naming it Delaunay-Rips (DR), we define, implement, and empirically test this computationally tractable simplicial complex construction for computing persistent homology of Euclidean point cloud data. We demonstrate the practical robustness of DR for persistent homology in comparison with other simplical complexes in machine learning applications such as predicting sleep state from patient heart rate. To justify the theoretical stability of DR, we prove the stability of the Delaunay triangulation of a pointcloud P under perturbations of the points of P. Specifically, we impose a notion of genericity on the points of P to ensure stability. In the final chapter, we contribute to the field of computational biology by taking a data-driven approach to learn topological features of designed proteins from their persistence diagrams. We find correlations between the learned topological features and biochemical features to investigate how protein structure relates to features identified by subject-matter experts. We train several machine learning models to assess the performance of incorporating topological features into training with biochemical features. Using cover-tree differencing via entropy reduction (CDER), we identify distinguishing regions of the persistence diagrams of stable/unstable proteins. More notably, we find statistically significant improvement in classification performance (in terms of average precision score) for certain designed secondary structure topologies.