🍩 Database of Original & Non-Theoretical Uses of Topology

(found 129 matches in 0.016137s)
  1. TOAST: Topological Algorithm for Singularity Tracking (2022)

    Julius von Rohrscheidt, Bastian Rieck
    Abstract The manifold hypothesis, which assumes that data lie on or close to an unknown manifold of low intrinsic dimensionality, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibit distinct non-manifold structures, which result in singularities that can lead to erroneous conclusions about the data. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address detecting singularities by developing (i) persistent local homology, a new topology-driven framework for quantifying the intrinsic dimension of a data set locally, and (ii) Euclidicity, a topology-based multi-scale measure for assessing the 'manifoldness' of individual points. We show that our approach can reliably identify singularities of complex spaces, while also capturing singular structures in real-world data sets.
  2. Topological Data Analysis for Electric Motor Eccentricity Fault Detection (2022)

    Bingnan Wang, Chungwei Lin, Hiroshi Inoue, Makoto Kanemaru
    Abstract In this paper, we develop topological data analysis (TDA) method for motor current signature analysis (MCSA), and apply it to induction motor eccentricity fault detection. We introduce TDA and present the procedure of extracting topological features from time-domain data that will be represented using persistence diagrams and vectorized Betti sequences. The procedure is applied to induction machine phase current signal analysis, and shown to be highly effective in differentiating signals from different eccentricity levels. With TDA, we are able to use a simple regression model that can predict the fault levels with reasonable accuracy, even for the data of eccentricity levels that are not seen in the training data. The proposed method is model-free, and only requires a small segment of time-domain data to make prediction. These advantages make it attractive for a wide range of fault detection applications.
  3. Quantitative Analysis of Phase Transitions in Two-Dimensional XY Models Using Persistent Homology (2022)

    Nicholas Sale, Jeffrey Giansiracusa, Biagio Lucini
    Abstract We use persistent homology and persistence images as an observable of three different variants of the two-dimensional XY model in order to identify and study their phase transitions. We examine models with the classical XY action, a topological lattice action, and an action with an additional nematic term. In particular, we introduce a new way of computing the persistent homology of lattice spin model configurations and, by considering the fluctuations in the output of logistic regression and k-nearest neighbours models trained on persistence images, we develop a methodology to extract estimates of the critical temperature and the critical exponent of the correlation length. We put particular emphasis on finite-size scaling behaviour and producing estimates with quantifiable error. For each model we successfully identify its phase transition(s) and are able to get an accurate determination of the critical temperatures and critical exponents of the correlation length.
  4. Severe Slugging Flow Identification From Topological Indicators (2022)

    Simone Casolo
    Abstract In this work, topological data analysis is used to identify the onset of severe slug flow in offshore petroleum production systems. Severe slugging is a multiphase flow regime known to be very inefficient and potentially harmful to process equipment and it is characterized by large oscillations in the production fluid pressure. Time series from pressure sensors in subsea oil wells are processed by means of Takens embedding to produce point clouds of data. Embedded sensor data is then analyzed using persistent homology to obtain topological indicators capable of revealing the occurrence of severe slugging in a condition-based monitoring approach. A large dataset of well events consisting of both real and simulated data is used to demonstrate the possibilty of authomatizing severe slugging detection from live data via topological data analysis. Methods based on persistence diagrams are shown to accurately identify severe slugging and to classify different flow regimes from pressure signals of producing wells with supervised machine learning.
  5. A Topological Machine Learning Pipeline for Classification (2022)

    Francesco Conti, Davide Moroni, Maria A. Pascali
    Abstract In this work, we develop a pipeline that associates Persistence Diagrams to digital data via the most appropriate filtration for the type of data considered. Using a grid search approach, this pipeline determines optimal representation methods and parameters. The development of such a topological pipeline for Machine Learning involves two crucial steps that strongly affect its performance: firstly, digital data must be represented as an algebraic object with a proper associated filtration in order to compute its topological summary, the Persistence Diagram. Secondly, the persistence diagram must be transformed with suitable representation methods in order to be introduced in a Machine Learning algorithm. We assess the performance of our pipeline, and in parallel, we compare the different representation methods on popular benchmark datasets. This work is a first step toward both an easy and ready-to-use pipeline for data classification using persistent homology and Machine Learning, and to understand the theoretical reasons why, given a dataset and a task to be performed, a pair (filtration, topological representation) is better than another.
  6. Exploring the Geometry and Topology of Neural Network Loss Landscapes (2022)

    Stefan Horoi, Jessie Huang, Bastian Rieck, Guillaume Lajoie, Guy Wolf, Smita Krishnaswamy
    Abstract Recent work has established clear links between the generalization performance of trained neural networks and the geometry of their loss landscape near the local minima to which they converge. This suggests that qualitative and quantitative examination of the loss landscape geometry could yield insights about neural network generalization performance during training. To this end, researchers have proposed visualizing the loss landscape through the use of simple dimensionality reduction techniques. However, such visualization methods have been limited by their linear nature and only capture features in one or two dimensions, thus restricting sampling of the loss landscape to lines or planes. Here, we expand and improve upon these in three ways. First, we present a novel “jump and retrain” procedure for sampling relevant portions of the loss landscape. We show that the resulting sampled data holds more meaningful information about the network’s ability to generalize. Next, we show that non-linear dimensionality reduction of the jump and retrain trajectories via PHATE, a trajectory and manifold-preserving method, allows us to visualize differences between networks that are generalizing well vs poorly. Finally, we combine PHATE trajectories with a computational homology characterization to quantify trajectory differences.
  7. Time-Inhomogeneous Diffusion Geometry and Topology (2022)

    Guillaume Huguet, Alexander Tong, Bastian Rieck, Jessie Huang, Manik Kuchroo, Matthew Hirn, Guy Wolf, Smita Krishnaswamy
    Abstract Diffusion condensation is a dynamic process that yields a sequence of multiscale data representations that aim to encode meaningful abstractions. It has proven effective for manifold learning, denoising, clustering, and visualization of high-dimensional data. Diffusion condensation is constructed as a time-inhomogeneous process where each step first computes and then applies a diffusion operator to the data. We theoretically analyze the convergence and evolution of this process from geometric, spectral, and topological perspectives. From a geometric perspective, we obtain convergence bounds based on the smallest transition probability and the radius of the data, whereas from a spectral perspective, our bounds are based on the eigenspectrum of the diffusion kernel. Our spectral results are of particular interest since most of the literature on data diffusion is focused on homogeneous processes. From a topological perspective, we show diffusion condensation generalizes centroid-based hierarchical clustering. We use this perspective to obtain a bound based on the number of data points, independent of their location. To understand the evolution of the data geometry beyond convergence, we use topological data analysis. We show that the condensation process itself defines an intrinsic diffusion homology. We use this intrinsic topology as well as an ambient topology to study how the data changes over diffusion time. We demonstrate both homologies in well-understood toy examples. Our work gives theoretical insights into the convergence of diffusion condensation, and shows that it provides a link between topological and geometric data analysis.
  8. Continuous Indexing of Fibrosis (CIF): Improving the Assessment and Classification of MPN Patients (2022)

    Hosuk Ryou, Korsuk Sirinukunwattana, Alan Aberdeen, Gillian Grindstaff, Bernadette Stolz, Helen Byrne, Heather A. Harrington, Nikolaos Sousos, Anna L. Godfrey, Claire N. Harrison, Bethan Psaila, Adam J. Mead, Gabrielle Rees, Gareth D. Turner, Jens Rittscher, Daniel Royston
    Abstract The detection and grading of fibrosis in myeloproliferative neoplasms (MPN) is an important component of disease classification, prognostication and disease monitoring. However, current fibrosis grading systems are only semi-quantitative and fail to capture sample heterogeneity. To improve the detection, quantitation and representation of reticulin fibrosis, we developed a machine learning (ML) approach using bone marrow trephine (BMT) samples (n = 107) from patients diagnosed with MPN or a reactive / nonneoplastic marrow. The resulting Continuous Indexing of Fibrosis (CIF) enhances the detection and monitoring of fibrosis within BMTs, and aids the discrimination of MPN subtypes. When combined with megakaryocyte feature analysis, CIF discriminates between the frequently challenging differential diagnosis of essential thrombocythemia (ET) and pre-fibrotic myelofibrosis (pre-PMF) with high predictive accuracy [area under the curve = 0.94]. CIF also shows significant promise in the identification of MPN patients at risk of disease progression; analysis of samples from 35 patients diagnosed with ET and enrolled in the Primary Thrombocythemia-1 (PT-1) trial identified features predictive of post-ET myelofibrosis (area under the curve = 0.77). In addition to these clinical applications, automated analysis of fibrosis has clear potential to further refine disease classification boundaries and inform future studies of the micro-environmental factors driving disease initiation and progression in MPN and other stem cell disorders. The image analysis methods used to generate CIF can be readily integrated with those of other key morphological features in MPNs, including megakaryocyte morphology, that lie beyond the scope of conventional histological assessment. Key PointsMachine learning enables an objective and quantitative description of reticulin fibrosis within the bone marrow of patients with myeloproliferative neoplasms (MPN),Automated analysis and Continuous Indexing of Fibrosis (CIF) captures heterogeneity within MPN samples and has utility in refined classification and disease monitoringQuantitative fibrosis assessment combined with topological data analysis may help to predict patients at increased risk of progression to post-ET myelofibrosis, and assist in the discrimination of ET and pre-fibrotic PMF (pre-PMF)
  9. Genomics Data Analysis via Spectral Shape and Topology (2022)

    Erik J. Amézquita, Farzana Nasrin, Kathleen M. Storey, Masato Yoshizawa
    Abstract Mapper, a topological algorithm, is frequently used as an exploratory tool to build a graphical representation of data. This representation can help to gain a better understanding of the intrinsic shape of high-dimensional genomic data and to retain information that may be lost using standard dimension-reduction algorithms. We propose a novel workflow to process and analyze RNA-seq data from tumor and healthy subjects integrating Mapper and differential gene expression. Precisely, we show that a Gaussian mixture approximation method can be used to produce graphical structures that successfully separate tumor and healthy subjects, and produce two subgroups of tumor subjects. A further analysis using DESeq2, a popular tool for the detection of differentially expressed genes, shows that these two subgroups of tumor cells bear two distinct gene regulations, suggesting two discrete paths for forming lung cancer, which could not be highlighted by other popular clustering methods, including t-SNE. Although Mapper shows promise in analyzing high-dimensional data, building tools to statistically analyze Mapper graphical structures is limited in the existing literature. In this paper, we develop a scoring method using heat kernel signatures that provides an empirical setting for statistical inferences such as hypothesis testing, sensitivity analysis, and correlation analysis.
  10. A Novel Quality Clustering Methodology on Fab-Wide Wafer Map Images in Semiconductor Manufacturing (2022)

    Yuan-Ming Hsu, Xiaodong Jia, Wenzhe Li, Jay Lee
    Abstract Abstract. In semiconductor manufacturing, clustering the fab-wide wafer map images is of critical importance for practitioners to understand the subclusters of wafer defects, recognize novel clusters or anomalies, and develop fast reactions to quality issues. However, due to the high-mix manufacturing of diversified wafer products of different sizes and technologies, it is difficult to cluster the wafer map images across the fab. This paper addresses this challenge by proposing a novel methodology for fab-wide wafer map data clustering. In the proposed methodology, a well-known deep learning technique, vision transformer with multi-head attention is first trained to convert binary wafer images of different sizes into condensed feature vectors for efficient clustering. Then, the Topological Data Analysis (TDA), which is widely used in biomedical applications, is employed to visualize the data clusters and identify the anomalies. The TDA yields a topological representation of high-dimensional big data as well as its local clusters by creating a graph that shows nodes corresponding to the clusters within the data. The effectiveness of the proposed methodology is demonstrated by clustering the public wafer map dataset WM-811k from the real application which has a total of 811,457 wafer map images. We further demonstrate the potential applicability of topology data analytics in the semiconductor area by visualization.
  11. Filtration Curves for Graph Representation (2021)

    Leslie O'Bray, Bastian Rieck, Karsten Borgwardt
    Abstract The two predominant approaches to graph comparison in recent years are based on (i) enumerating matching subgraphs or (ii) comparing neighborhoods of nodes. In this work, we complement these two perspectives with a third way of representing graphs: using filtration curves from topological data analysis that capture both edge weight information and global graph structure. Filtration curves are highly efficient to compute and lead to expressive representations of graphs, which we demonstrate on graph classification benchmark datasets. Our work opens the door to a new form of graph representation in data mining.
  12. Topological Graph Neural Networks (2021)

    Max Horn, Edward De Brouwer, Michael Moor, Yves Moreau, Bastian Rieck, Karsten Borgwardt
    Abstract Graph neural networks (GNNs) are a powerful architecture for tackling graph learning tasks, yet have been shown to be oblivious to eminent substructures, such as cycles. We present TOGL, a novel layer that incorporates global topological information of a graph using persistent homology. TOGL can be easily integrated into any type of GNN and is strictly more expressive in terms of the Weisfeiler--Lehman test of isomorphism. Augmenting GNNs with our layer leads to beneficial predictive performance, both on synthetic data sets, which can be trivially classified by humans but not by ordinary GNNs, and on real-world data.
  13. Data-Driven and Automatic Surface Texture Analysis Using Persistent Homology (2021)

    Melih C. Yesilli, Firas A. Khasawneh
    Abstract Surface roughness plays an important role in analyzing engineering surfaces. It quantifies the surface topography and can be used to determine whether the resulting surface finish is acceptable or not. Nevertheless, while several existing tools and standards are available for computing surface roughness, these methods rely heavily on user input thus slowing down the analysis and increasing manufacturing costs. Therefore, fast and automatic determination of the roughness level is essential to avoid costs resulting from surfaces with unacceptable finish, and user-intensive analysis. In this study, we propose a Topological Data Analysis (TDA) based approach to classify the roughness level of synthetic surfaces using both their areal images and profiles. We utilize persistent homology from TDA to generate persistence diagrams that encapsulate information on the shape of the surface. We then obtain feature matrices for each surface or profile using Carlsson coordinates, persistence images, and template functions. We compare our results to two widely used methods in the literature: Fast Fourier Transform (FFT) and Gaussian filtering. The results show that our approach yields mean accuracies as high as 97%. We also show that, in contrast to existing surface analysis tools, our TDA-based approach is fully automatable and provides adaptive feature extraction.
  14. Topological Data Analysis of C. Elegans Locomotion and Behavior (2021)

    Ashleigh Thomas, Kathleen Bates, Alex Elchesen, Iryna Hartsock, Hang Lu, Peter Bubenik
    Abstract Video of nematodes/roundworms was analyzed using persistent homology to study locomotion and behavior. In each frame, an organism's body posture was represented by a high-dimensional vector. By concatenating points in fixed-duration segments of this time series, we created a sliding window embedding (sometimes called a time delay embedding) where each point corresponds to a sequence of postures of an organism. Persistent homology on the points in this time series detected behaviors and comparisons of these persistent homology computations detected variation in their corresponding behaviors. We used average persistence landscapes and machine learning techniques to study changes in locomotion and behavior in varying environments.
  15. Topological Regularization for Dense Prediction (2021)

    Deqing Fu, Bradley J. Nelson
    Abstract Dense prediction tasks such as depth perception and semantic segmentation are important applications in computer vision that have a concrete topological description in terms of partitioning an image into connected components or estimating a function with a small number of local extrema corresponding to objects in the image. We develop a form of topological regularization based on persistent homology that can be used in dense prediction tasks with these topological descriptions. Experimental results show that the output topology can also appear in the internal activations of trained neural networks which allows for a novel use of topological regularization to the internal states of neural networks during training, reducing the computational cost of the regularization. We demonstrate that this topological regularization of internal activations leads to improved convergence and test benchmarks on several problems and architectures.
  16. Geometric Feature Performance Under Downsampling for EEG Classification Tasks (2021)

    Bryan Bischof, Eric Bunch
    Abstract We experimentally investigate a collection of feature engineering pipelines for use with a CNN for classifying eyes-open or eyes-closed from electroencephalogram (EEG) time-series from the Bonn dataset. Using the Takens' embedding--a geometric representation of time-series--we construct simplicial complexes from EEG data. We then compare \$\epsilon\$-series of Betti-numbers and \$\epsilon\$-series of graph spectra (a novel construction)--two topological invariants of the latent geometry from these complexes--to raw time series of the EEG to fill in a gap in the literature for benchmarking. These methods, inspired by Topological Data Analysis, are used for feature engineering to capture local geometry of the time-series. Additionally, we test these feature pipelines' robustness to downsampling and data reduction. This paper seeks to establish clearer expectations for both time-series classification via geometric features, and how CNNs for time-series respond to data of degraded resolution.
  17. TDAExplore: Quantitative Analysis of Fluorescence Microscopy Images Through Topology-Based Machine Learning (2021)

    Parker Edwards, Kristen Skruber, Nikola Milićević, James B. Heidings, Tracy-Ann Read, Peter Bubenik, Eric A. Vitriol
    Abstract Recent advances in machine learning have greatly enhanced automatic methods to extract information from fluorescence microscopy data. However, current machine-learning-based models can require hundreds to thousands of images to train, and the most readily accessible models classify images without describing which parts of an image contributed to classification. Here, we introduce TDAExplore, a machine learning image analysis pipeline based on topological data analysis. It can classify different types of cellular perturbations after training with only 20–30 high-resolution images and performs robustly on images from multiple subjects and microscopy modes. Using only images and whole-image labels for training, TDAExplore provides quantitative, spatial information, characterizing which image regions contribute to classification. Computational requirements to train TDAExplore models are modest and a standard PC can perform training with minimal user input. TDAExplore is therefore an accessible, powerful option for obtaining quantitative information about imaging data in a wide variety of applications.
  18. Determining Structural Properties of Artificial Neural Networks Using Algebraic Topology (2021)

    David P. Fernández, Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marta Villegas
    Abstract Artificial Neural Networks (ANNs) are widely used for approximating complex functions. The process that is usually followed to define the most appropriate architecture for an ANN given a specific function is mostly empirical. Once this architecture has been defined, weights are usually optimized according to the error function. On the other hand, we observe that ANNs can be represented as graphs and their topological 'fingerprints' can be obtained using Persistent Homology (PH). In this paper, we describe a proposal focused on designing more principled architecture search procedures. To do this, different architectures for solving problems related to a heterogeneous set of datasets have been analyzed. The results of the evaluation corroborate that PH effectively characterizes the ANN invariants: when ANN density (layers and neurons) or sample feeding order is the only difference, PH topological invariants appear; in the opposite direction in different sub-problems (i.e. different labels), PH varies. This approach based on topological analysis helps towards the goal of designing more principled architecture search procedures and having a better understanding of ANNs.
  19. Persistent Homology Based Graph Convolution Network for Fine-Grained 3D Shape Segmentation (2021)

    Chi-Chong Wong, Chi-Man Vong
    Abstract Fine-grained 3D segmentation is an important task in 3D object understanding, especially in applications such as intelligent manufacturing or parts analysis for 3D objects. However, many challenges involved in such problem are yet to be solved, such as i) interpreting the complex structures located in different regions for 3D objects; ii) capturing fine-grained structures with sufficient topology correctness. Current deep learning and graph machine learning methods fail to tackle such challenges and thus provide inferior performance in fine-grained 3D analysis. In this work, methods in topological data analysis are incorporated with geometric deep learning model for the task of fine-grained segmentation for 3D objects. We propose a novel neural network model called Persistent Homology based Graph Convolution Network (PHGCN), which i) integrates persistent homology into graph convolution network to capture multi-scale structural information that can accurately represent complex structures for 3D objects; ii) applies a novel Persistence Diagram Loss (ℒPD) that provides sufficient topology correctness for segmentation over the fine-grained structures. Extensive experiments on fine-grained 3D segmentation validate the effectiveness of the proposed PHGCN model and show significant improvements over current state-of-the-art methods.
  20. The Shape of Cancer Relapse: Topological Data Analysis Predicts Recurrence in Paediatric Acute Lymphoblastic Leukaemia (2021)

    Salvador Chulián, Bernadette J. Stolz, Álvaro Martínez-Rubio, Cristina B. Goñi, Juan F. Gutiérrez, Teresa C. Velázquez, Águeda M. Quintana, Manuel R. Orellana, Ana C. Robleda, José L. Soler, Alfredo M. Puras, María V. Sánchez, María Rosa, Víctor M. Pérez-García, Helen Byrne
    Abstract Acute Lymphoblastic Leukaemia (ALL) is the most frequent paediatric cancer. Modern therapies have improved survival rates, but approximately 15-20 % of patients relapse. At present, patients’ risk of relapse are assessed by projecting high-dimensional flow cytometry data onto a subset of biomarkers and manually estimating the shape of this reduced data. Here, we apply methods from topological data analysis (TDA), which quantify shape in data via features such as connected components and loops, to pre-treatment ALL datasets with known outcomes. We combine these fully unsupervised analyses with machine learning to identify features in the pre-treatment data that are prognostic for risk of relapse. We find significant topological differences between relapsing and non-relapsing patients and confirm the predictive power of CD10, CD20, CD38, and CD45. Further, we are able to use the TDA descriptors to predict patients who relapsed. We propose three prognostic pipelines that readily extend to other haematological malignancies. Teaser Topology reveals features in flow cytometry data which predict relapse of patients with acute lymphoblastic leukemia
  21. Quantification of the Immune Content in Neuroblastoma: Deep Learning and Topological Data Analysis in Digital Pathology (2021)

    Nicole Bussola, Bruno Papa, Ombretta Melaiu, Aurora Castellano, Doriana Fruci, Giuseppe Jurman
    Abstract We introduce here a novel machine learning (ML) framework to address the issue of the quantitative assessment of the immune content in neuroblastoma (NB) specimens. First, the EUNet, a U-Net with an EfficientNet encoder, is trained to detect lymphocytes on tissue digital slides stained with the CD3 T-cell marker. The training set consists of 3782 images extracted from an original collection of 54 whole slide images (WSIs), manually annotated for a total of 73,751 lymphocytes. Resampling strategies, data augmentation, and transfer learning approaches are adopted to warrant reproducibility and to reduce the risk of overfitting and selection bias. Topological data analysis (TDA) is then used to define activation maps from different layers of the neural network at different stages of the training process, described by persistence diagrams (PD) and Betti curves. TDA is further integrated with the uniform manifold approximation and projection (UMAP) dimensionality reduction and the hierarchical density-based spatial clustering of applications with noise (HDBSCAN) algorithm for clustering, by the deep features, the relevant subgroups and structures, across different levels of the neural network. Finally, the recent TwoNN approach is leveraged to study the variation of the intrinsic dimensionality of the U-Net model. As the main task, the proposed pipeline is employed to evaluate the density of lymphocytes over the whole tissue area of the WSIs. The model achieves good results with mean absolute error 3.1 on test set, showing significant agreement between densities estimated by our EUNet model and by trained pathologists, thus indicating the potentialities of a promising new strategy in the quantification of the immune content in NB specimens. Moreover, the UMAP algorithm unveiled interesting patterns compatible with pathological characteristics, also highlighting novel insights into the dynamics of the intrinsic dataset dimensionality at different stages of the training process. All the experiments were run on the Microsoft Azure cloud platform.
  22. Steinhaus Filtration and Stable Paths in the Mapper (2020)

    Dustin L. Arendt, Matthew Broussard, Bala Krishnamoorthy, Nathaniel Saul
    Abstract Two central concepts from topological data analysis are persistence and the Mapper construction. Persistence employs a sequence of objects built on data called a filtration. A Mapper produces insightful summaries of data, and has found widespread applications in diverse areas. We define a new filtration called the cover filtration built from a single cover based on a generalized Steinhaus distance, which is a generalization of Jaccard distance. We prove a stability result: the cover filtrations of two covers are \$\alpha/m\$ interleaved, where \$\alpha\$ is a bound on bottleneck distance between covers and \$m\$ is the size of smallest set in either cover. We also show our construction is equivalent to the Cech filtration under certain settings, and the Vietoris-Rips filtration completely determines the cover filtration in all cases. We then develop a theory for stable paths within this filtration. Unlike standard results on stability in topological persistence, our definition of path stability aligns exactly with the above result on stability of cover filtration. We demonstrate how our framework can be employed in a variety of applications where a metric is not obvious but a cover is readily available. First we present a new model for recommendation systems using cover filtration. For an explicit example, stable paths identified on a movies data set represent sequences of movies constituting gentle transitions from one genre to another. As a second application in explainable machine learning, we apply the Mapper for model induction, providing explanations in the form of paths between subpopulations. Stable paths in the Mapper from a supervised machine learning model trained on the FashionMNIST data set provide improved explanations of relationships between subpopulations of images.
  23. Topological Data Analysis in Text Classification: Extracting Features With Additive Information (2020)

    Shafie Gholizadeh, Ketki Savle, Armin Seyeditabari, Wlodek Zadrozny
    Abstract While the strength of Topological Data Analysis has been explored in many studies on high dimensional numeric data, it is still a challenging task to apply it to text. As the primary goal in topological data analysis is to define and quantify the shapes in numeric data, defining shapes in the text is much more challenging, even though the geometries of vector spaces and conceptual spaces are clearly relevant for information retrieval and semantics. In this paper, we examine two different methods of extraction of topological features from text, using as the underlying representations of words the two most popular methods, namely word embeddings and TF-IDF vectors. To extract topological features from the word embedding space, we interpret the embedding of a text document as high dimensional time series, and we analyze the topology of the underlying graph where the vertices correspond to different embedding dimensions. For topological data analysis with the TF-IDF representations, we analyze the topology of the graph whose vertices come from the TF-IDF vectors of different blocks in the textual document. In both cases, we apply homological persistence to reveal the geometric structures under different distance resolutions. Our results show that these topological features carry some exclusive information that is not captured by conventional text mining methods. In our experiments we observe adding topological features to the conventional features in ensemble models improves the classification results (up to 5\%). On the other hand, as expected, topological features by themselves may be not sufficient for effective classification. It is an open problem to see whether TDA features from word embeddings might be sufficient, as they seem to perform within a range of few points from top results obtained with a linear support vector classifier.
  24. Cell Complex Neural Networks (2020)

    Mustafa Hajij, Kyle Istvan, Ghada Zamzami
    Abstract Cell complexes are topological spaces constructed from simple blocks called cells. They generalize graphs, simplicial complexes, and polyhedral complexes that form important domains for practical applications. We propose a general, combinatorial, and unifying construction for performing neural network-type computations on cell complexes. Furthermore, we introduce inter-cellular message passing schemes, message passing schemes on cell complexes that take the topology of the underlying space into account. In particular, our method generalizes many of the most popular types of graph neural networks.
  25. Interpretable Phase Detection and Classification With Persistent Homology (2020)

    Alex Cole, Gregory J. Loges, Gary Shiu
    Abstract We apply persistent homology to the task of discovering and characterizing phase transitions, using lattice spin models from statistical physics for working examples. Persistence images provide a useful representation of the homological data for conducting statistical tasks. To identify the phase transitions, a simple logistic regression on these images is sufficient for the models we consider, and interpretable order parameters are then read from the weights of the regression. Magnetization, frustration and vortex-antivortex structure are identified as relevant features for characterizing phase transitions.
  26. Simplicial Neural Networks (2020)

    Stefania Ebli, Michaël Defferrard, Gard Spreemann
    Abstract We present simplicial neural networks (SNNs), a generalization of graph neural networks to data that live on a class of topological spaces called simplicial complexes. These are natural multi-dimensional extensions of graphs that encode not only pairwise relationships but also higher-order interactions between vertices - allowing us to consider richer data, including vector fields and \$n\$-fold collaboration networks. We define an appropriate notion of convolution that we leverage to construct the desired convolutional neural networks. We test the SNNs on the task of imputing missing data on coauthorship complexes.
  27. Can Neural Networks Learn Persistent Homology Features? (2020)

    Guido Montúfar, Nina Otter, Yuguang Wang
    Abstract Topological data analysis uses tools from topology -- the mathematical area that studies shapes -- to create representations of data. In particular, in persistent homology, one studies one-parameter families of spaces associated with data, and persistence diagrams describe the lifetime of topological invariants, such as connected components or holes, across the one-parameter family. In many applications, one is interested in working with features associated with persistence diagrams rather than the diagrams themselves. In our work, we explore the possibility of learning several types of features extracted from persistence diagrams using neural networks.
  28. Graph Filtration Learning (2020)

    Christoph Hofer, Florian Graf, Bastian Rieck, Marc Niethammer, Roland Kwitt
    Abstract We propose an approach to learning with graph-structured data in the problem domain of graph classification. In particular, we present a novel type of readout operation to aggregate node features into a graph-level representation. To this end, we leverage persistent homology computed via a real-valued, learnable, filter function. We establish the theoretical foundation for differentiating through the persistent homology computation. Empirically, we show that this type of readout operation compares favorably to previous techniques, especially when the graph connectivity structure is informative for the learning problem.
  29. Topological Autoencoders (2020)

    Michael Moor, Max Horn, Bastian Rieck, Karsten Borgwardt
    Abstract We propose a novel approach for preserving topological structures of the input space in latent representations of autoencoders. Using persistent homology, a technique from topological data analysis, we calculate topological signatures of both the input and latent space to derive a topological loss term. Under weak theoretical assumptions, we construct this loss in a differentiable manner, such that the encoding learns to retain multi-scale connectivity information. We show that our approach is theoretically well-founded and that it exhibits favourable latent representations on a synthetic manifold as well as on real-world image data sets, while preserving low reconstruction errors.
  30. Topologically Densified Distributions (2020)

    Christoph Hofer, Florian Graf, Marc Niethammer, Roland Kwitt
    Abstract We study regularization in the context of small sample-size learning with over-parametrized neural networks. Specifically, we shift focus from architectural properties, such as norms on the network weights, to properties of the internal representations before a linear classifier. Specifically, we impose a topological constraint on samples drawn from the probability measure induced in that space. This provably leads to mass concentration effects around the representations of training instances, i.e., a property beneficial for generalization. By leveraging previous work to impose topological constrains in a neural network setting, we provide empirical evidence (across various vision benchmarks) to support our claim for better generalization.
  31. A Topological Framework for Deep Learning (2020)

    Mustafa Hajij, Kyle Istvan
    Abstract We utilize classical facts from topology to show that the classification problem in machine learning is always solvable under very mild conditions. Furthermore, we show that a softmax classification network acts on an input topological space by a finite sequence of topological moves to achieve the classification task. Moreover, given a training dataset, we show how topological formalism can be used to suggest the appropriate architectural choices for neural networks designed to be trained as classifiers on the data. Finally, we show how the architecture of a neural network cannot be chosen independently from the shape of the underlying data. To demonstrate these results, we provide example datasets and show how they are acted upon by neural nets from this topological perspective.
  32. Topological Machine Learning for Multivariate Time Series (2020)

    Chengyuan Wu, Carol A. Hargreaves
    Abstract We develop a framework for analyzing multivariate time series using topological data analysis (TDA) methods. The proposed methodology involves converting the multivariate time series to point cloud data, calculating Wasserstein distances between the persistence diagrams and using the \$k\$-nearest neighbors algorithm (\$k\$-NN) for supervised machine learning. Two methods (symmetry-breaking and anchor points) are also introduced to enable TDA to better analyze data with heterogeneous features that are sensitive to translation, rotation, or choice of coordinates. We apply our methods to room occupancy detection based on 5 time-dependent variables (temperature, humidity, light, CO2 and humidity ratio). Experimental results show that topological methods are effective in predicting room occupancy during a time window. We also apply our methods to an Activity Recognition dataset and obtained good results.
  33. A Novel Method of Extracting Topological Features From Word Embeddings (2020)

    Shafie Gholizadeh, Armin Seyeditabari, Wlodek Zadrozny
    Abstract In recent years, topological data analysis has been utilized for a wide range of problems to deal with high dimensional noisy data. While text representations are often high dimensional and noisy, there are only a few work on the application of topological data analysis in natural language processing. In this paper, we introduce a novel algorithm to extract topological features from word embedding representation of text that can be used for text classification. Working on word embeddings, topological data analysis can interpret the embedding high-dimensional space and discover the relations among different embedding dimensions. We will use persistent homology, the most commonly tool from topological data analysis, for our experiment. Examining our topological algorithm on long textual documents, we will show our defined topological features may outperform conventional text mining features.
  34. Generalized Penalty for Circular Coordinate Representation (2020)

    Hengrui Luo, Alice Patania, Jisu Kim, Mikael Vejdemo-Johansson
    Abstract Topological Data Analysis (TDA) provides novel approaches that allow us to analyze the geometrical shapes and topological structures of a dataset. As one important application, TDA can be used for data visualization and dimension reduction. We follow the framework of circular coordinate representation, which allows us to perform dimension reduction and visualization for high-dimensional datasets on a torus using persistent cohomology. In this paper, we propose a method to adapt the circular coordinate framework to take into account sparsity in high-dimensional applications. We use a generalized penalty function instead of an \$L_\2\\$ penalty in the traditional circular coordinate algorithm. We provide simulation experiments and real data analysis to support our claim that circular coordinates with generalized penalty will accommodate the sparsity in high-dimensional datasets under different sampling schemes while preserving the topological structures.
  35. Contagion Dynamics for Manifold Learning (2020)

    Barbara I. Mahler
    Abstract Contagion maps exploit activation times in threshold contagions to assign vectors in high-dimensional Euclidean space to the nodes of a network. A point cloud that is the image of a contagion map reflects both the structure underlying the network and the spreading behaviour of the contagion on it. Intuitively, such a point cloud exhibits features of the network's underlying structure if the contagion spreads along that structure, an observation which suggests contagion maps as a viable manifold-learning technique. We test contagion maps as a manifold-learning tool on a number of different real-world and synthetic data sets, and we compare their performance to that of Isomap, one of the most well-known manifold-learning algorithms. We find that, under certain conditions, contagion maps are able to reliably detect underlying manifold structure in noisy data, while Isomap fails due to noise-induced error. This consolidates contagion maps as a technique for manifold learning.
  36. Persistent Homology Advances Interpretable Machine Learning for Nanoporous Materials (2020)

    Aditi S. Krishnapriyan, Joseph Montoya, Jens Hummelshøj, Dmitriy Morozov
    Abstract Machine learning for nanoporous materials design and discovery has emerged as a promising alternative to more time-consuming experiments and simulations. The challenge with this approach is the selection of features that enable universal and interpretable materials representations across multiple prediction tasks. We use persistent homology to construct holistic representations of the materials structure. We show that these representations can also be augmented with other generic features such as word embeddings from natural language processing to capture chemical information. We demonstrate our approach on multiple metal-organic framework datasets by predicting a variety of gas adsorption targets. Our results show considerable improvement in both accuracy and transferability across targets compared to models constructed from commonly used manually curated features. Persistent homology features allow us to locate the pores that correlate best to adsorption at different pressures, contributing to understanding atomic level structure-property relationships for materials design.
  37. Quantitative and Interpretable Order Parameters for Phase Transitions From Persistent Homology (2020)

    Alex Cole, Gregory J. Loges, Gary Shiu
    Abstract We apply modern methods in computational topology to the task of discovering and characterizing phase transitions. As illustrations, we apply our method to four two-dimensional lattice spin models: the Ising, square ice, XY, and fully-frustrated XY models. In particular, we use persistent homology, which computes the births and deaths of individual topological features as a coarse-graining scale or sublevel threshold is increased, to summarize multiscale and high-point correlations in a spin configuration. We employ vector representations of this information called persistence images to formulate and perform the statistical task of distinguishing phases. For the models we consider, a simple logistic regression on these images is sufficient to identify the phase transition. Interpretable order parameters are then read from the weights of the regression. This method suffices to identify magnetization, frustration, and vortex-antivortex structure as relevant features for phase transitions in our models. We also define "persistence" critical exponents and study how they are related to those critical exponents usually considered.
  38. Topological Echoes of Primordial Physics in the Universe at Large Scales (2020)

    Alex Cole, Matteo Biagetti, Gary Shiu
    Abstract We present a pipeline for characterizing and constraining initial conditions in cosmology via persistent homology. The cosmological observable of interest is the cosmic web of large scale structure, and the initial conditions in question are non-Gaussianities (NG) of primordial density perturbations. We compute persistence diagrams and derived statistics for simulations of dark matter halos with Gaussian and non-Gaussian initial conditions. For computational reasons and to make contact with experimental observations, our pipeline computes persistence in sub-boxes of full simulations and simulations are subsampled to uniform halo number. We use simulations with large NG (\$f_\\rm NL\\textasciicircum\\rm loc\=250\$) as templates for identifying data with mild NG (\$f_\\rm NL\\textasciicircum\\rm loc\=10\$), and running the pipeline on several cubic volumes of size \$40~(\textrm\Gpc/h\)\textasciicircum\3\\$, we detect \$f_\\rm NL\\textasciicircum\\rm loc\=10\$ at \$97.5\%\$ confidence on \$\sim 85\%\$ of the volumes for our best single statistic. Throughout we benefit from the interpretability of topological features as input for statistical inference, which allows us to make contact with previous first-principles calculations and make new predictions.
  39. Capturing Dynamics of Time-Varying Data via Topology (2020)

    Lu Xian, Henry Adams, Chad M. Topaz, Lori Ziegelmeier
    Abstract One approach to understanding complex data is to study its shape through the lens of algebraic topology. While the early development of topological data analysis focused primarily on static data, in recent years, theoretical and applied studies have turned to data that varies in time. A time-varying collection of metric spaces as formed, for example, by a moving school of fish or flock of birds, can contain a vast amount of information. There is often a need to simplify or summarize the dynamic behavior. We provide an introduction to topological summaries of time-varying metric spaces including vineyards [17], crocker plots [52], and multiparameter rank functions [34]. We then introduce a new tool to summarize time-varying metric spaces: a crocker stack. Crocker stacks are convenient for visualization, amenable to machine learning, and satisfy a desirable stability property which we prove. We demonstrate the utility of crocker stacks for a parameter identification task involving an influential model of biological aggregations [54]. Altogether, we aim to bring the broader applied mathematics community up-to-date on topological summaries of time-varying metric spaces.
  40. PersGNN: Applying Topological Data Analysis and Geometric Deep Learning to Structure-Based Protein Function Prediction (2020)

    Nicolas Swenson, Aditi S. Krishnapriyan, Aydin Buluc, Dmitriy Morozov, Katherine Yelick
    Abstract Understanding protein structure-function relationships is a key challenge in computational biology, with applications across the biotechnology and pharmaceutical industries. While it is known that protein structure directly impacts protein function, many functional prediction tasks use only protein sequence. In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank in order to study the expressiveness of different structure-based prediction schemes. We present PersGNN - an end-to-end trainable deep learning model that combines graph representation learning with topological data analysis to capture a complex set of both local and global structural features. While variations of these techniques have been successfully applied to proteins before, we demonstrate that our hybridized approach, PersGNN, outperforms either method on its own as well as a baseline neural network that learns from the same information. PersGNN achieves a 9.3% boost in area under the precision recall curve (AUPR) compared to the best individual model, as well as high F1 scores across different gene ontology categories, indicating the transferability of this approach.
  41. Uncovering the Topology of Time-Varying fMRI Data Using Cubical Persistence (2020)

    Bastian Rieck, Tristan Yates, Christian Bock, Karsten Borgwardt, Guy Wolf, Nicholas Turk-Browne, Smita Krishnaswamy
    Abstract Functional magnetic resonance imaging (fMRI) is a crucial technology for gaining insights into cognitive processes in humans. Data amassed from fMRI measurements result in volumetric data sets that vary over time. However, analysing such data presents a challenge due to the large degree of noise and person-to-person variation in how information is represented in the brain. To address this challenge, we present a novel topological approach that encodes each time point in an fMRI data set as a persistence diagram of topological features, i.e. high-dimensional voids present in the data. This representation naturally does not rely on voxel-by-voxel correspondence and is robust to noise. We show that these time-varying persistence diagrams can be clustered to find meaningful groupings between participants, and that they are also useful in studying within-subject brain state trajectories of subjects performing a particular task. Here, we apply both clustering and trajectory analysis techniques to a group of participants watching the movie 'Partly Cloudy'. We observe significant differences in both brain state trajectories and overall topological activity between adults and children watching the same movie.
  42. PI-Net: A Deep Learning Approach to Extract Topological Persistence Images (2020)

    Anirudh Som, Hongjun Choi, Karthikeyan N. Ramamurthy, Matthew Buman, Pavan Turaga
    Abstract Topological features such as persistence diagrams and their functional approximations like persistence images (PIs) have been showing substantial promise for machine learning and computer vision applications. This is greatly attributed to the robustness topological representations provide against different types of physical nuisance variables seen in real-world data, such as view-point, illumination, and more. However, key bottlenecks to their large scale adoption are computational expenditure and difficulty incorporating them in a differentiable architecture. We take an important step in this paper to mitigate these bottlenecks by proposing a novel one-step approach to generate PIs directly from the input data. We design two separate convolutional neural network architectures, one designed to take in multi-variate time series signals as input and another that accepts multi-channel images as input. We call these networks Signal PI-Net and Image PINet respectively. To the best of our knowledge, we are the first to propose the use of deep learning for computing topological features directly from data. We explore the use of the proposed PI-Net architectures on two applications: human activity recognition using tri-axial accelerometer sensor data and image classification. We demonstrate the ease of fusion of PIs in supervised deep learning architectures and speed up of several orders of magnitude for extracting PIs from data. Our code is available at https://github.com/anirudhsom/PI-Net.
  43. Prediction in Cancer Genomics Using Topological Signatures and Machine Learning (2020)

    Georgina Gonzalez, Arina Ushakova, Radmila Sazdanovic, Javier Arsuaga
    Abstract Copy Number Aberrations, gains and losses of genomic regions, are a hallmark of cancer and can be experimentally detected using microarray comparative genomic hybridization (aCGH). In previous works, we developed a topology based method to analyze aCGH data whose output are regions of the genome where copy number is altered in patients with a predetermined cancer phenotype. We call this method Topological Analysis of array CGH (TAaCGH). Here we combine TAaCGH with machine learning techniques to build classifiers using copy number aberrations. We chose logistic regression on two different binary phenotypes related to breast cancer to illustrate this approach. The first case consists of patients with over-expression of the ERBB2 gene. Over-expression of ERBB2 is commonly regulated by a copy number gain in chromosome arm 17q. TAaCGH found the region 17q11-q22 associated with the phenotype and using logistic regression we reduced this region to 17q12-q21.31 correctly classifying 78% of the ERBB2 positive individuals (sensitivity) in a validation data set. We also analyzed over-expression in Estrogen Receptor (ER), a second phenotype commonly observed in breast cancer patients and found that the region 5p14.3-12 together with six full arms were associated with the phenotype. Our method identified 4p, 6p and 16q as the strongest predictors correctly classifying 76% of ER positives in our validation data set. However, for this set there was a significant increase in the false positive rate (specificity). We suggest that topological and machine learning methods can be combined for prediction of phenotypes using genetic data.
  44. Topological Descriptors Help Predict Guest Adsorption in Nanoporous Materials (2020)

    Aditi S. Krishnapriyan, Maciej Haranczyk, Dmitriy Morozov
    Abstract Machine learning has emerged as an attractive alternative to experiments and simulations for predicting material properties. Usually, such an approach relies on specific domain knowledge for feature design: each learning target requires careful selection of features that an expert recognizes as important for the specific task. The major drawback of this approach is that computation of only a few structural features has been implemented so far, and it is difficult to tell a priori which features are important for a particular application. The latter problem has been empirically observed for predictors of guest uptake in nanoporous materials: local and global porosity features become dominant descriptors at low and high pressures, respectively. We investigate a feature representation of materials using tools from topological data analysis. Specifically, we use persistent homology to describe the geometry of nanoporous materials at various scales. We combine our topological descriptor with traditional structural features and investigate the relative importance of each to the prediction tasks. We demonstrate an application of this feature representation by predicting methane adsorption in zeolites, for pressures in the range of 1-200 bar. Our results not only show a considerable improvement compared to the baseline, but they also highlight that topological features capture information complementary to the structural features: this is especially important for the adsorption at low pressure, a task particularly difficult for the traditional features. Furthermore, by investigation of the importance of individual topological features in the adsorption model, we are able to pinpoint the location of the pores that correlate best to adsorption at different pressure, contributing to our atom-level understanding of structure-property relationships.
  45. Mapping Firms' Locations in Technological Space: A Topological Analysis of Patent Statistics (2020)

    Emerson G. Escolar, Yasuaki Hiraoka, Mitsuru Igami, Yasin Ozcan
    Abstract Where do firms innovate? Mapping their locations in technological space is difficult, because it is high dimensional and unstructured. We address this issue by using a method in computational topology called the Mapper algorithm, which combines local clustering with global reconstruction. We apply this method to a panel of 333 major firms’ patent portfolios in 1976–2005 across 430 technological areas. Results suggest the Mapper graph captures salient patterns in firms’ patenting histories, and our measures of their uniqueness (the length of “flares”) are correlated with firms’ financial performances in a statistically and economically significant manner. We then compare this approach with a widely used clustering method by Jaffe (1989) to highlight additional findings.
  46. Topological Analysis Reveals State Transitions in Human Gut and Marine Bacterial Communities (2020)

    William K. Chang, David VanInsberghe, Libusha Kelly
    Abstract Microbiome dynamics influence the health and functioning of human physiology and the environment and are driven in part by interactions between large numbers of microbial taxa, making large-scale prediction and modeling a challenge. Here, using topological data analysis, we identify states and dynamical features relevant to macroscopic processes. We show that gut disease processes and marine geochemical events are associated with transitions between community states, defined as topological features of the data density. We find a reproducible two-state succession during recovery from cholera in the gut microbiomes of multiple patients, evidence of dynamic stability in the gut microbiome of a healthy human after experiencing diarrhea during travel, and periodic state transitions in a marine Prochlorococcus community driven by water column cycling. Our approach bridges small-scale fluctuations in microbiome composition and large-scale changes in phenotype without details of underlying mechanisms, and provides an assessment of microbiome stability and its relation to human and environmental health.
  47. Topological Data Analysis of Single-Cell Hi-C Contact Maps (2020)

    Mathieu Carrière, Raúl Rabadán
    Abstract Due to recent breakthroughs in high-throughput sequencing, it is now possible to use chromosome conformation capture (CCC) to understand the three dimensional conformation of DNA at the whole genome level, and to characterize it with the so-called contact maps. This is very useful since many biological processes are correlated with DNA folding, such as DNA transcription. However, the methods for the analysis of such conformations are still lacking mathematical guarantees and statistical power. To handle this issue, we propose to use the Mapper, which is a standard tool of Topological Data Analysis (TDA) that allows one to efficiently encode the inherent continuity and topology of underlying biological processes in data, in the form of a graph with various features such as branches and loops. In this article, we show how recent statistical techniques developed in TDA for the Mapper algorithm can be extended and leveraged to formally define and statistically quantify the presence of topological structures coming from biological phenomena, such as the cell cyle, in datasets of CCC contact maps.
  48. Identification of Relevant Genetic Alterations in Cancer Using Topological Data Analysis (2020)

    Raúl Rabadán, Yamina Mohamedi, Udi Rubin, Tim Chu, Adam N. Alghalith, Oliver Elliott, Luis Arnés, Santiago Cal, Álvaro J. Obaya, Arnold J. Levine, Pablo G. Cámara
    Abstract Large-scale cancer genomic studies enable the systematic identification of mutations that lead to the genesis and progression of tumors, uncovering the underlying molecular mechanisms and potential therapies. While some such mutations are recurrently found in many tumors, many others exist solely within a few samples, precluding detection by conventional recurrence-based statistical approaches. Integrated analysis of somatic mutations and RNA expression data across 12 tumor types reveals that mutations of cancer genes are usually accompanied by substantial changes in expression. We use topological data analysis to leverage this observation and uncover 38 elusive candidate cancer-associated genes, including inactivating mutations of the metalloproteinase ADAMTS12 in lung adenocarcinoma. We show that ADAMTS12−/− mice have a five-fold increase in the susceptibility to develop lung tumors, confirming the role of ADAMTS12 as a tumor suppressor gene. Our results demonstrate that data integration through topological techniques can increase our ability to identify previously unreported cancer-related alterations., Rare cancer mutations are often missed using recurrence-based statistical approaches, but are usually accompanied by changes in expression. Here the authors leverage this information to uncover several elusive candidate cancer-associated genes using topological data analysis.
  49. A Topological Data Analysis Based Classification Method for Multiple Measurements (2019)

    Henri Riihimäki, Wojciech Chachólski, Jakob Theorell, Jan Hillert, Ryan Ramanujam
    Abstract \textlessh3\textgreaterAbstract\textless/h3\textgreater \textlessh3\textgreaterBackground\textless/h3\textgreater \textlessp\textgreaterMachine learning models for repeated measurements are limited. Using topological data analysis (TDA), we present a classifier for repeated measurements which samples from the data space and builds a network graph based on the data topology. When applying this to two case studies, accuracy exceeds alternative models with additional benefits such as reporting data subsets with high purity along with feature values.\textless/p\textgreater\textlessh3\textgreaterResults\textless/h3\textgreater \textlessp\textgreaterFor 300 examples of 3 tree species, the accuracy reached 80% after 30 datapoints, which was improved to 90% after increased sampling to 400 datapoints. Using data from 100 examples of each of 6 point processes, the classifier achieved 96.8% accuracy. In both datasets, the TDA classifier outperformed an alternative model.\textless/p\textgreater\textlessh3\textgreaterConclusions\textless/h3\textgreater \textlessp\textgreaterThis algorithm and software can be beneficial for repeated measurement data common in biological sciences, as both an accurate classifier and a feature selection tool.\textless/p\textgreater
  50. Text Classification via Network Topology: A Case Study on the Holy Quran (2019)

    Mehmet E. Aktas, Esra Akbas
    Abstract Due to the growth in the number of texts and documents available online, machine learning based text classification systems are getting more popular recently. Feature extraction, converting unstructured text into a structured feature space, is one of the essential tasks for text classification. In this paper, we propose a novel feature extraction approach for text classification using the network representation of text, network topology, and machine learning techniques. We present experimental results on classifying the Holy Quran chapters based on the place each chapter was revealed to illustrate the effectiveness of the approach.
  51. Topological Machine Learning With Persistence Indicator Functions (2019)

    Bastian Rieck, Filip Sadlo, Heike Leitte
    Abstract Techniques from computational topology, in particular persistent homology, are becoming increasingly relevant for data analysis. Their stable metrics permit the use of many distance-based data analysis methods, such as multidimensional scaling, while providing a firm theoretical ground. Many modern machine learning algorithms, however, are based on kernels. This paper presents persistence indicator functions (PIFs), which summarize persistence diagrams, i.e., feature descriptors in topological data analysis. PIFs can be calculated and compared in linear time and have many beneficial properties, such as the availability of a kernel-based similarity measure. We demonstrate their usage in common data analysis scenarios, such as confidence set estimation and classification of complex structured data.
  52. Hyperparameter Optimization of Topological Features for Machine Learning Applications (2019)

    Francis Motta, Christopher Tralie, Rossella Bedini, Fabiano Bini, Gilberto Bini, Hamed Eramian, Marcio Gameiro, Steve Haase, Hugh Haddox, John Harer, Nick Leiby, Franco Marinozzi, Scott Novotney, Gabe Rocklin, Jed Singer, Devin Strickland, Matt Vaughn
    Abstract This paper describes a general pipeline for generating optimal vector representations of topological features of data for use with machine learning algorithms. This pipeline can be viewed as a costly black-box function defined over a complex configuration space, each point of which specifies both how features are generated and how predictive models are trained on those features. We propose using state-of-the-art Bayesian optimization algorithms to inform the choice of topological vectorization hyperparameters while simultaneously choosing learning model parameters. We demonstrate the need for and effectiveness of this pipeline using two difficult biological learning problems, and illustrate the nontrivial interactions between topological feature generation and learning model hyperparameters.
  53. Persistent Homology Machine Learning for Fingerprint Classification (2019)

    N. Giansiracusa, R. Giansiracusa, C. Moon
    Abstract The fingerprint classification problem is to sort fingerprints into predetermined groups, such as arch, loop, and whorl. It was asserted in the literature that minutiae points, which are commonly used for fingerprint matching, are not useful for classification. We show that, to the contrary, near state-of-the-art classification accuracy rates can be achieved when applying topological data analysis (TDA) to 3-dimensional point clouds of oriented minutiae points. We also apply TDA to fingerprint ink-roll images, which yields a lower accuracy rate but still shows promise; moreover, combining the two approaches outperforms each one individually. These methods use supervised learning applied to persistent homology and allow us to explore feature selection on barcodes, an important topic at the interface between TDA and machine learning. We test our classification algorithms on the NIST fingerprint database SD-27.
  54. Fast and Accurate Tumor Segmentation of Histology Images Using Persistent Homology and Deep Convolutional Features (2019)

    Talha Qaiser, Yee-Wah Tsang, Daiki Taniyama, Naoya Sakamoto, Kazuaki Nakane, David Epstein, Nasir Rajpoot
    Abstract Tumor segmentation in whole-slide images of histology slides is an important step towards computer-assisted diagnosis. In this work, we propose a tumor segmentation framework based on the novel concept of persistent homology profiles (PHPs). For a given image patch, the homology profiles are derived by efficient computation of persistent homology, which is an algebraic tool from homology theory. We propose an efficient way of computing topological persistence of an image, alternative to simplicial homology. The PHPs are devised to distinguish tumor regions from their normal counterparts by modeling the atypical characteristics of tumor nuclei. We propose two variants of our method for tumor segmentation: one that targets speed without compromising accuracy and the other that targets higher accuracy. The fast version is based on a selection of exemplar image patches from a convolution neural network (CNN) and patch classification by quantifying the divergence between the PHPs of exemplars and the input image patch. Detailed comparative evaluation shows that the proposed algorithm is significantly faster than competing algorithms while achieving comparable results. The accurate version combines the PHPs and high-level CNN features and employs a multi-stage ensemble strategy for image patch labeling. Experimental results demonstrate that the combination of PHPs and CNN features outperform competing algorithms. This study is performed on two independently collected colorectal datasets containing adenoma, adenocarcinoma, signet, and healthy cases. Collectively, the accurate tumor segmentation produces the highest average patch-level F1-score, as compared with competing algorithms, on malignant and healthy cases from both the datasets. Overall the proposed framework highlights the utility of persistent homology for histopathology image analysis.
  55. Analyzing Collective Motion With Machine Learning and Topology (2019)

    Dhananjay Bhaskar, Angelika Manhart, Jesse Milzman, John T. Nardini, Kathleen M. Storey, Chad M. Topaz, Lori Ziegelmeier
    Abstract We use topological data analysis and machine learning to study a seminal model of collective motion in biology [M. R. D’Orsogna et al., Phys. Rev. Lett. 96, 104302 (2006)]. This model describes agents interacting nonlinearly via attractive-repulsive social forces and gives rise to collective behaviors such as flocking and milling. To classify the emergent collective motion in a large library of numerical simulations and to recover model parameters from the simulation data, we apply machine learning techniques to two different types of input. First, we input time series of order parameters traditionally used in studies of collective motion. Second, we input measures based on topology that summarize the time-varying persistent homology of simulation data over multiple scales. This topological approach does not require prior knowledge of the expected patterns. For both unsupervised and supervised machine learning methods, the topological approach outperforms the one that is based on traditional order parameters.
  56. Hepatic Tumor Classification Using Texture and Topology Analysis of Non-Contrast-Enhanced Three-Dimensional T1-Weighted MR Images With a Radiomics Approach (2019)

    Asuka Oyama, Yasuaki Hiraoka, Ippei Obayashi, Yusuke Saikawa, Shigeru Furui, Kenshiro Shiraishi, Shinobu Kumagai, Tatsuya Hayashi, Jun’ichi Kotoku
    Abstract The purpose of this study is to evaluate the accuracy for classification of hepatic tumors by characterization of T1-weighted magnetic resonance (MR) images using two radiomics approaches with machine learning models: texture analysis and topological data analysis using persistent homology. This study assessed non-contrast-enhanced fat-suppressed three-dimensional (3D) T1-weighted images of 150 hepatic tumors. The lesions included 50 hepatocellular carcinomas (HCCs), 50 metastatic tumors (MTs), and 50 hepatic hemangiomas (HHs) found respectively in 37, 23, and 33 patients. For classification, texture features were calculated, and also persistence images of three types (degree 0, degree 1 and degree 2) were obtained for each lesion from the 3D MR imaging data. We used three classification models. In the classification of HCC and MT (resp. HCC and HH, HH and MT), we obtained accuracy of 92% (resp. 90%, 73%) by texture analysis, and the highest accuracy of 85% (resp. 84%, 74%) when degree 1 (resp. degree 1, degree 2) persistence images were used. Our methods using texture analysis or topological data analysis allow for classification of the three hepatic tumors with considerable accuracy, and thus might be useful when applied for computer-aided diagnosis with MR images.
  57. Topological Data Analysis for Genomics and Evolution: Topology in Biology (2019)

    Raul Rabadan, Andrew J. Blumberg
    Abstract Biology has entered the age of Big Data. A technical revolution has transformed the field, and extracting meaningful information from large biological data sets is now a central methodological challenge. Algebraic topology is a well-established branch of pure mathematics that studies qualitative descriptors of the shape of geometric objects. It aims to reduce comparisons of shape to a comparison of algebraic invariants, such as numbers, which are typically easier to work with. Topological data analysis is a rapidly developing subfield that leverages the tools of algebraic topology to provide robust multiscale analysis of data sets. This book introduces the central ideas and techniques of topological data analysis and its specific applications to biology, including the evolution of viruses, bacteria and humans, genomics of cancer, and single cell characterization of developmental processes. Bridging two disciplines, the book is for researchers and graduate students in genomics and evolutionary biology as well as mathematicians interested in applied topology.
  58. Molecular Phenotyping Using Networks, Diffusion, and Topology: Soft Tissue Sarcoma (2019)

    James C. Mathews, Maryam Pouryahya, Caroline Moosmüller, Yannis G. Kevrekidis, Joseph O. Deasy, Allen Tannenbaum
    Abstract Many biological datasets are high-dimensional yet manifest an underlying order. In this paper, we describe an unsupervised data analysis methodology that operates in the setting of a multivariate dataset and a network which expresses influence between the variables of the given set. The technique involves network geometry employing the Wasserstein distance, global spectral analysis in the form of diffusion maps, and topological data analysis using the Mapper algorithm. The prototypical application is to gene expression profiles obtained from RNA-Seq experiments on a collection of tissue samples, considering only genes whose protein products participate in a known pathway or network of interest. Employing the technique, we discern several coherent states or signatures displayed by the gene expression profiles of the sarcomas in the Cancer Genome Atlas along the TP53 (p53) signaling network. The signatures substantially recover the leiomyosarcoma, dedifferentiated liposarcoma (DDLPS), and synovial sarcoma histological subtype diagnoses, and they also include a new signature defined by activation and inactivation of about a dozen genes, including activation of serine endopeptidase inhibitor SERPINE1 and inactivation of TP53-family tumor suppressor gene TP73.
  59. Topological Gene Expression Networks Recapitulate Brain Anatomy and Function (2019)

    Alice Patania, Pierluigi Selvaggi, Mattia Veronese, Ottavia Dipasquale, Paul Expert, Giovanni Petri
    Abstract Understanding how gene expression translates to and affects human behavior is one of the ultimate goals of neuroscience. In this paper, we present a pipeline based on Mapper, a topological simplification tool, to analyze gene co-expression data. We first validate the method by reproducing key results from the literature on the Allen Human Brain Atlas and the correlations between resting-state fMRI and gene co-expression maps. We then analyze a dopamine-related gene set and find that co-expression networks produced by Mapper return a structure that matches the well-known anatomy of the dopaminergic pathway. Our results suggest that network based descriptions can be a powerful tool to explore the relationships between genetic pathways and their association with brain function and its perturbation due to illness and/or pharmacological challenges., In this paper, we described a gene co-expression analysis pipeline that produces networks that we show to be closely related to either brain function and to neurotransmitter pathways. Our results suggest that this pipeline could be developed into a platform enabling the exploration of the effects of physiological and pathological alterations to specific gene sets, including profiling drugs effects.
  60. Representability of Algebraic Topology for Biomolecules in Machine Learning Based Scoring and Virtual Screening (2018)

    Zixuan Cang, Lin Mu, Guo-Wei Wei
    Abstract This work introduces a number of algebraic topology approaches, including multi-component persistent homology, multi-level persistent homology, and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. In contrast to the conventional persistent homology, multi-component persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for protein-ligand binding analysis and virtual screening of small molecules. Extensive numerical experiments involving 4,414 protein-ligand complexes from the PDBBind database and 128,374 ligand-target and decoy-target pairs in the DUD database are performed to test respectively the scoring power and the discriminatory power of the proposed topological learning strategies. It is demonstrated that the present topological learning outperforms other existing methods in protein-ligand binding affinity prediction and ligand-decoy discrimination.
  61. Chatter Classification in Turning Using Machine Learning and Topological Data Analysis⁎⁎This Material Is Based Upon Work Supported by the National Science Foundation Under Grant Nos. CMMI-1759823 and DMS-1759824 With PI FAK, and CMMI-1800466 and DMS-1800446 With PI EM. JAP Acknowledges the Support of the NSF Under Grant DMS-1622301 and DARPA Under Grant HR0011-16-2-003. (2018)

    Firas A. Khasawneh, Elizabeth Munch, Jose A. Perea
    Abstract Chatter identification and detection in machining processes has been an active area of research in the past two decades. Part of the challenge in studying chatter is that machining equations that describe its occurrence are often nonlinear delay differential equations. The majority of the available tools for chatter identification rely on defining a metric that captures the characteristics of chatter, and a threshold that signals its occurrence. The difficulty in choosing these parameters can be somewhat alleviated by utilizing machine learning techniques. However, even with a successful classification algorithm, the transferability of typical machine learning methods from one data set to another remains very limited. In this paper we combine supervised machine learning with Topological Data Analysis (TDA) to obtain a descriptor of the process which can detect chatter. The features we use are derived from the persistence diagram of an attractor reconstructed from the time series via Takens embedding. We test the approach using deterministic and stochastic turning models, where the stochasticity is introduced via the cutting coefficient term. Our results show a 97% successful classification rate on the deterministic model labeled by the stability diagram obtained using the spectral element method. The features gleaned from the deterministic model are then utilized for characterization of chatter in a stochastic turning model where there are very limited analysis methods.
  62. Towards a New Approach to Reveal Dynamical Organization of the Brain Using Topological Data Analysis (2018)

    Manish Saggar, Olaf Sporns, Javier Gonzalez-Castillo, Peter A. Bandettini, Gunnar Carlsson, Gary Glover, Allan L. Reiss
    Abstract Approaches describing how the brain changes to accomplish cognitive tasks tend to rely on collapsed data. Here, authors present a new approach that maintains high dimensionality and use it to describe individual differences in how brain activity is represented and organized across different cognitive tasks.
  63. Improved Understanding of Aqueous Solubility Modeling Through Topological Data Analysis (2018)

    Mariam Pirashvili, Lee Steinberg, Francisco B. Guillamon, Mahesan Niranjan, Jeremy G. Frey, Jacek Brodzki
    Abstract Topological data analysis is a family of recent mathematical techniques seeking to understand the ‘shape’ of data, and has been used to understand the structure of the descriptor space produced from a standard chemical informatics software from the point of view of solubility. We have used the mapper algorithm, a TDA method that creates low-dimensional representations of data, to create a network visualization of the solubility space. While descriptors with clear chemical implications are prominent features in this space, reflecting their importance to the chemical properties, an unexpected and interesting correlation between chlorine content and rings and their implication for solubility prediction is revealed. A parallel representation of the chemical space was generated using persistent homology applied to molecular graphs. Links between this chemical space and the descriptor space were shown to be in agreement with chemical heuristics. The use of persistent homology on molecular graphs, extended by the use of norms on the associated persistence landscapes allow the conversion of discrete shape descriptors to continuous ones, and a perspective of the application of these descriptors to quantitative structure property relations is presented.
  64. MRI and Biomechanics Multidimensional Data Analysis Reveals R2 -R1ρ as an Early Predictor of Cartilage Lesion Progression in Knee Osteoarthritis (2018)

    Valentina Pedoia, Jenny Haefeli, Kazuhito Morioka, Hsiang-Ling Teng, Lorenzo Nardo, Richard B. Souza, Adam R. Ferguson, Sharmila Majumdar
    Abstract PURPOSE: To couple quantitative compositional MRI, gait analysis, and machine learning multidimensional data analysis to study osteoarthritis (OA). OA is a multifactorial disorder accompanied by biochemical and morphological changes in the articular cartilage, modulated by skeletal biomechanics and gait. While we can now acquire detailed information about the knee joint structure and function, we are not yet able to leverage the multifactorial factors for diagnosis and disease management of knee OA. MATERIALS AND METHODS: We mapped 178 subjects in a multidimensional space integrating: demographic, clinical information, gait kinematics and kinetics, cartilage compositional T1ρ and T2 and R2 -R1ρ (1/T2 -1/T1ρ ) acquired at 3T and whole-organ magnetic resonance imaging score morphological grading. Topological data analysis (TDA) and Kolmogorov-Smirnov test were adopted for data integration, analysis, and hypothesis generation. Regression models were used for hypothesis testing. RESULTS: The results of the TDA showed a network composed of three main patient subpopulations, thus potentially identifying new phenotypes. T2 and T1ρ values (T2 lateral femur P = 1.45*10-8 , T1ρ medial tibia P = 1.05*10-5 ), the presence of femoral cartilage defects (P = 0.0013), lesions in the meniscus body (P = 0.0035), and race (P = 2.44*10-4 ) were key markers in the subpopulation classification. Within one of the subpopulations we observed an association between the composite metric R2 -R1ρ and the longitudinal progression of cartilage lesions. CONCLUSION: The analysis presented demonstrates some of the complex multitissue biochemical and biomechanical interactions that define joint degeneration and OA using a multidimensional approach, and potentially indicates that R2 -R1ρ may be an imaging biomarker for early OA. LEVEL OF EVIDENCE: 3 Technical Efficacy: Stage 2 J. Magn. Reson. Imaging 2018;47:78-90.
  65. Using Multidimensional Topological Data Analysis to Identify Traits of Hip Osteoarthritis (2018)

    Jasmine Rossi‐deVries, Valentina Pedoia, Michael A. Samaan, Adam R. Ferguson, Richard B. Souza, Sharmila Majumdar
    Abstract Background Osteoarthritis (OA) is a multifaceted disease with many variables affecting diagnosis and progression. Topological data analysis (TDA) is a state-of-the-art big data analytics tool that can combine all variables into multidimensional space. TDA is used to simultaneously analyze imaging and gait analysis techniques. Purpose To identify biochemical and biomechanical biomarkers able to classify different disease progression phenotypes in subjects with and without radiographic signs of hip OA. Study Type Longitudinal study for comparison of progressive and nonprogressive subjects. Population In all, 102 subjects with and without radiographic signs of hip osteoarthritis. Field Strength/Sequence 3T, SPGR 3D MAPSS T1ρ/T2, intermediate-weighted fat-suppressed fast spin-echo (FSE). Assessment Multidimensional data analysis including cartilage composition, bone shape, Kellgren–Lawrence (KL) classification of osteoarthritis, scoring hip osteoarthritis with MRI (SHOMRI), hip disability and osteoarthritis outcome score (HOOS). Statistical Tests Analysis done using TDA, Kolmogorov–Smirnov (KS) testing, and Benjamini-Hochberg to rank P-value results to correct for multiple comparisons. Results Subjects in the later stages of the disease had an increased SHOMRI score (P \textless 0.0001), increased KL (P = 0.0012), and older age (P \textless 0.0001). Subjects in the healthier group showed intact cartilage and less pain. Subjects found between these two groups had a range of symptoms. Analysis of this subgroup identified knee biomechanics (P \textless 0.0001) as an initial marker of the disease that is noticeable before the morphological progression and degeneration. Further analysis of an OA subgroup with femoroacetabular impingement (FAI) showed anterior labral tears to be the most significant marker (P = 0.0017) between those FAI subjects with and without OA symptoms. Data Conclusion The data-driven analysis obtained with TDA proposes new phenotypes of these subjects that partially overlap with the radiographic-based classical disease status classification and also shows the potential for further examination of an early onset biomechanical intervention. Level of Evidence: 2 Technical Efficacy: Stage 2 J. Magn. Reson. Imaging 2018;48:1046–1058.
  66. Persistence Images: A Stable Vector Representation of Persistent Homology (2017)

    Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick Shipman, Sofya Chepushtanova, Eric Hanson, Francis Motta, Lori Ziegelmeier
    Abstract Many data sets can be viewed as a noisy sampling of an underlying space, and tools from topological data analysis can characterize this structure for the purpose of knowledge discovery. One such tool is persistent homology, which provides a multiscale description of the homological features within a data set. A useful representation of this homological information is a persistence diagram (PD). Efforts have been made to map PDs into spaces with additional structure valuable to machine learning tasks. We convert a PD to a finite-dimensional vector representation which we call a persistence image (PI), and prove the stability of this transformation with respect to small perturbations in the inputs. The discriminatory power of PIs is compared against existing methods, showing significant performance gains. We explore the use of PIs with vector-based machine learning tools, such as linear sparse support vector machines, which identify features containing discriminating topological information. Finally, high accuracy inference of parameter values from the dynamic output of a discrete dynamical system (the linked twist map) and a partial differential equation (the anisotropic Kuramoto-Sivashinsky equation) provide a novel application of the discriminatory power of PIs.
  67. Identification of Topological Network Modules in Perturbed Protein Interaction Networks (2017)

    Mihaela E. Sardiu, Joshua M. Gilmore, Brad Groppe, Laurence Florens, Michael P. Washburn
    Abstract Biological networks consist of functional modules, however detecting and characterizing such modules in networks remains challenging. Perturbing networks is one strategy for identifying modules. Here we used an advanced mathematical approach named topological data analysis (TDA) to interrogate two perturbed networks. In one, we disrupted the S. cerevisiae INO80 protein interaction network by isolating complexes after protein complex components were deleted from the genome. In the second, we reanalyzed previously published data demonstrating the disruption of the human Sin3 network with a histone deacetylase inhibitor. Here we show that disrupted networks contained topological network modules (TNMs) with shared properties that mapped onto distinct locations in networks. We define TMNs as proteins that occupy close network positions depending on their coordinates in a topological space. TNMs provide new insight into networks by capturing proteins from different categories including proteins within a complex, proteins with shared biological functions, and proteins disrupted across networks.
  68. Single-Cell Topological RNA-Seq Analysis Reveals Insights Into Cellular Differentiation and Development (2017)

    Abbas H. Rizvi, Pablo G. Camara, Elena K. Kandror, Thomas J. Roberts, Ira Schieren, Tom Maniatis, Raul Rabadan
    Abstract Transcriptional programs control cellular lineage commitment and differentiation during development. Understanding cell fate has been advanced by studying single-cell RNA-seq, but is limited by the assumptions of current analytic methods regarding the structure of data. We present single-cell topological data analysis (scTDA), an algorithm for topology-based computational analyses to study temporal, unbiased transcriptional regulation. Compared to other methods, scTDA is a non-linear, model-independent, unsupervised statistical framework that can characterize transient cellular states. We applied scTDA to the analysis of murine embryonic stem cell (mESC) differentiation in vitro in response to inducers of motor neuron differentiation. scTDA resolved asynchrony and continuity in cellular identity over time, and identified four transient states (pluripotent, precursor, progenitor, and fully differentiated cells) based on changes in stage-dependent combinations of transcription factors, RNA-binding proteins and long non-coding RNAs. scTDA can be applied to study asynchronous cellular responses to either developmental cues or environmental perturbations.
  69. Uncovering Precision Phenotype-Biomarker Associations in Traumatic Brain Injury Using Topological Data Analysis (2017)

    Jessica L. Nielson, Shelly R. Cooper, John K. Yue, Marco D. Sorani, Tomoo Inoue, Esther L. Yuh, Pratik Mukherjee, Tanya C. Petrossian, Jesse Paquette, Pek Y. Lum, Gunnar E. Carlsson, Mary J. Vassar, Hester F. Lingsma, Wayne A. Gordon, Alex B. Valadka, David O. Okonkwo, Geoffrey T. Manley, Adam R. Ferguson, Track-Tbi Investigators
    Abstract Background Traumatic brain injury (TBI) is a complex disorder that is traditionally stratified based on clinical signs and symptoms. Recent imaging and molecular biomarker innovations provide unprecedented opportunities for improved TBI precision medicine, incorporating patho-anatomical and molecular mechanisms. Complete integration of these diverse data for TBI diagnosis and patient stratification remains an unmet challenge. Methods and findings The Transforming Research and Clinical Knowledge in Traumatic Brain Injury (TRACK-TBI) Pilot multicenter study enrolled 586 acute TBI patients and collected diverse common data elements (TBI-CDEs) across the study population, including imaging, genetics, and clinical outcomes. We then applied topology-based data-driven discovery to identify natural subgroups of patients, based on the TBI-CDEs collected. Our hypothesis was two-fold: 1) A machine learning tool known as topological data analysis (TDA) would reveal data-driven patterns in patient outcomes to identify candidate biomarkers of recovery, and 2) TDA-identified biomarkers would significantly predict patient outcome recovery after TBI using more traditional methods of univariate statistical tests. TDA algorithms organized and mapped the data of TBI patients in multidimensional space, identifying a subset of mild TBI patients with a specific multivariate phenotype associated with unfavorable outcome at 3 and 6 months after injury. Further analyses revealed that this patient subset had high rates of post-traumatic stress disorder (PTSD), and enrichment in several distinct genetic polymorphisms associated with cellular responses to stress and DNA damage (PARP1), and in striatal dopamine processing (ANKK1, COMT, DRD2). Conclusions TDA identified a unique diagnostic subgroup of patients with unfavorable outcome after mild TBI that were significantly predicted by the presence of specific genetic polymorphisms. Machine learning methods such as TDA may provide a robust method for patient stratification and treatment planning targeting identified biomarkers in future clinical trials in TBI patients. Trial Registration ClinicalTrials.gov Identifier NCT01565551
  70. The Classification of Endoscopy Images With Persistent Homology (2016)

    Olga Dunaeva, Herbert Edelsbrunner, Anton Lukyanov, Michael Machin, Daria Malkova, Roman Kuvaev, Sergey Kashin
    Abstract Aiming at the automatic diagnosis of tumors using narrow band imaging (NBI) magnifying endoscopic (ME) images of the stomach, we combine methods from image processing, topology, geometry, and machine learning to classify patterns into three classes: oval, tubular and irregular. Training the algorithm on a small number of images of each type, we achieve a high rate of correct classifications. The analysis of the learning algorithm reveals that a handful of geometric and topological features are responsible for the overwhelming majority of decisions.
  71. Omics-Based Strategies in Precision Medicine: Toward a Paradigm Shift in Inborn Errors of Metabolism Investigations (2016)

    Abdellah Tebani, Carlos Afonso, Stéphane Marret, Soumeya Bekri
    Abstract The rise of technologies that simultaneously measure thousands of data points represents the heart of systems biology. These technologies have had a huge impact on the discovery of next-generation diagnostics, biomarkers, and drugs in the precision medicine era. Systems biology aims to achieve systemic exploration of complex interactions in biological systems. Driven by high-throughput omics technologies and the computational surge, it enables multi-scale and insightful overviews of cells, organisms, and populations. Precision medicine capitalizes on these conceptual and technological advancements and stands on two main pillars: data generation and data modeling. High-throughput omics technologies allow the retrieval of comprehensive and holistic biological information, whereas computational capabilities enable high-dimensional data modeling and, therefore, accessible and user-friendly visualization. Furthermore, bioinformatics has enabled comprehensive multi-omics and clinical data integration for insightful interpretation. Despite their promise, the translation of these technologies into clinically actionable tools has been slow. In this review, we present state-of-the-art multi-omics data analysis strategies in a clinical context. The challenges of omics-based biomarker translation are discussed. Perspectives regarding the use of multi-omics approaches for inborn errors of metabolism (IEM) are presented by introducing a new paradigm shift in addressing IEM investigations in the post-genomic era.
  72. Visualizing Emergent Identity of Assemblages in the Consumer Internet of Things: A Topological Data Analysis Approach (2016)

    Thomas Novak, Donna L. Hoffman
    Abstract The identity of a consumer Internet of Things (IoT) assemblage emerges through a historical process of ongoing interactions among consumers, smart devices, and digital information. Topological Data Analysis (TDA), consistent with mathematical aspects of assemblage theory, is used to visualize the underlying possibility space from which individual IoT assemblages emerge.
  73. Toward Automated Prediction of Manufacturing Productivity Based on Feature Selection Using Topological Data Analysis (2016)

    Wei Guo, Ashis G. Banerjee
    Abstract In this paper, we extend the application of topological data analysis (TDA) to the field of manufacturing for the first time to the best of our knowledge. We apply a particular TDA method, known as the Mapper algorithm, on a benchmark chemical processing data set. The algorithm yields a topological network that captures the intrinsic clusters and connections among the clusters present in the high-dimensional data set, which are difficult to detect using traditional methods. We select key process variables or features that impact the final product yield by analyzing the shape of this network. We then use three prediction models to evaluate the impact of the selected features. Results show that the models achieve the same level of high prediction accuracy as with all the process variables, thereby, providing a way to carry out process monitoring and control in a more cost-effective manner.
  74. Tracking Resilience to Infections by Mapping Disease Space (2016)

    Brenda Y. Torres, Jose H. Oliveira, Ann T. Tate, Poonam Rath, Katherine Cumnock, David S. Schneider
    Abstract Infected hosts differ in their responses to pathogens; some hosts are resilient and recover their original health, whereas others follow a divergent path and die. To quantitate these differences, we propose mapping the routes infected individuals take through “disease space.” We find that when plotting physiological parameters against each other, many pairs have hysteretic relationships that identify the current location of the host and predict the future route of the infection. These maps can readily be constructed from experimental longitudinal data, and we provide two methods to generate the maps from the cross-sectional data that is commonly gathered in field trials. We hypothesize that resilient hosts tend to take small loops through disease space, whereas nonresilient individuals take large loops. We support this hypothesis with experimental data in mice infected with Plasmodium chabaudi, finding that dying mice trace a large arc in red blood cells (RBCs) by reticulocyte space as compared to surviving mice. We find that human malaria patients who are heterozygous for sickle cell hemoglobin occupy a small area of RBCs by reticulocyte space, suggesting this approach can be used to distinguish resilience in human populations. This technique should be broadly useful in describing the in-host dynamics of infections in both model hosts and patients at both population and individual levels.
  75. WDR76 Co-Localizes With Heterochromatin Related Proteins and Rapidly Responds to DNA Damage (2016)

    Joshua M. Gilmore, Mihaela E. Sardiu, Brad D. Groppe, Janet L. Thornton, Xingyu Liu, Gerald Dayebgadoh, Charles A. Banks, Brian D. Slaughter, Jay R. Unruh, Jerry L. Workman, Laurence Florens, Michael P. Washburn
    Abstract Proteins that respond to DNA damage play critical roles in normal and diseased states in human biology. Studies have suggested that the S. cerevisiae protein CMR1/YDL156w is associated with histones and is possibly associated with DNA repair and replication processes. Through a quantitative proteomic analysis of affinity purifications here we show that the human homologue of this protein, WDR76, shares multiple protein associations with the histones H2A, H2B, and H4. Furthermore, our quantitative proteomic analysis of WDR76 associated proteins demonstrated links to proteins in the DNA damage response like PARP1 and XRCC5 and heterochromatin related proteins like CBX1, CBX3, and CBX5. Co-immunoprecipitation studies validated these interactions. Next, quantitative imaging studies demonstrated that WDR76 was recruited to laser induced DNA damage immediately after induction, and we compared the recruitment of WDR76 to laser induced DNA damage to known DNA damage proteins like PARP1, XRCC5, and RPA1. In addition, WDR76 co-localizes to puncta with the heterochromatin proteins CBX1 and CBX5, which are also recruited to DNA damage but much less intensely than WDR76. This work demonstrates the chromatin and DNA damage protein associations of WDR76 and demonstrates the rapid response of WDR76 to laser induced DNA damage.
  76. Topological Data Analysis: A Promising Big Data Exploration Tool in Biology, Analytical Chemistry and Physical Chemistry (2016)

    Marc Offroy, Ludovic Duponchel
    Abstract An important feature of experimental science is that data of various kinds is being produced at an unprecedented rate. This is mainly due to the development of new instrumental concepts and experimental methodologies. It is also clear that the nature of acquired data is significantly different. Indeed in every areas of science, data take the form of always bigger tables, where all but a few of the columns (i.e. variables) turn out to be irrelevant to the questions of interest, and further that we do not necessary know which coordinates are the interesting ones. Big data in our lab of biology, analytical chemistry or physical chemistry is a future that might be closer than any of us suppose. It is in this sense that new tools have to be developed in order to explore and valorize such data sets. Topological data analysis (TDA) is one of these. It was developed recently by topologists who discovered that topological concept could be useful for data analysis. The main objective of this paper is to answer the question why topology is well suited for the analysis of big data set in many areas and even more efficient than conventional data analysis methods. Raman analysis of single bacteria should be providing a good opportunity to demonstrate the potential of TDA for the exploration of various spectroscopic data sets considering different experimental conditions (with high noise level, with/without spectral preprocessing, with wavelength shift, with different spectral resolution, with missing data).
  77. Topological Data Analysis for Discovery in Preclinical Spinal Cord Injury and Traumatic Brain Injury (2015)

    Jessica L. Nielson, Jesse Paquette, Aiwen W. Liu, Cristian F. Guandique, C. A. Tovar, Tomoo Inoue, Karen-Amanda Irvine, John C. Gensel, Jennifer Kloke, Tanya C. Petrossian, Pek Y. Lum, Gunnar E. Carlsson, Geoffrey T. Manley, Wise Young, Michael S. Beattie, Jacqueline C. Bresnahan, Adam R. Ferguson
    Abstract Data-driven discovery in complex neurological disorders has potential to extract meaningful knowledge from large, heterogeneous datasets. Here the authors apply topological data analysis to assess therapeutic effects in preclinical traumatic brain injury and spinal cord injury research studies.
  78. Conserved Abundance and Topological Features in Chromatin-Remodeling Protein Interaction Networks (2015)

    Mihaela E. Sardiu, Joshua M. Gilmore, Brad D. Groppe, Damir Herman, Sreenivasa R. Ramisetty, Yong Cai, Jingji Jin, Ronald C. Conaway, Joan W. Conaway, Laurence Florens, Michael P. Washburn
    Abstract Abstract The study of conserved protein interaction networks seeks to better understand the evolution and regulation of protein interactions. Here, we present a quantitative proteomic analysis of 18 orthologous baits from three distinct chromatin-remodeling complexes in Saccharomyces cerevisiae and Homo sapiens. We demonstrate that abundance levels of orthologous proteins correlate strongly between the two organisms and both networks have highly similar topologies. We therefore used the protein abundances in one species to cross-predict missing protein abundance levels in the other species. Lastly, we identified a novel conserved low-abundance subnetwork further demonstrating the value of quantitative analysis of networks.
  79. Topic Detection in Twitter Using Topology Data Analysis (2015)

    Pablo Torres-Tramón, Hugo Hromic, Bahareh R. Heravi
    Abstract The massive volume of content generated by social media greatly exceeds human capacity to manually process this data in order to identify topics of interest. As a solution, various automated topic detection approaches have been proposed, most of which are based on document clustering and burst detection. These approaches normally represent textual features in standard n-dimensional Euclidean metric spaces. However, in these cases, directly filtering noisy documents is challenging for topic detection. Instead we propose Topol, a topic detection method based on Topology Data Analysis (TDA) that transforms the Euclidean feature space into a topological space where the shapes of noisy irrelevant documents are much easier to distinguish from topically-relevant documents. This topological space is organised in a network according to the connectivity of the points, i.e. the documents, and by only filtering based on the size of the connected components we obtain competitive results compared to other state of the art topic detection methods.
  80. Topographical Transcriptome Mapping of the Mouse Medial Ganglionic Eminence by Spatially Resolved RNA-seq (2014)

    Sabrina Zechel, Pawel Zajac, Peter Lönnerberg, Carlos F. Ibáñez, Sten Linnarsson
    Abstract Cortical interneurons originating from the medial ganglionic eminence, MGE, are among the most diverse cells within the CNS. Different pools of proliferating progenitor cells are thought to exist in the ventricular zone of the MGE, but whether the underlying subventricular and mantle regions of the MGE are spatially patterned has not yet been addressed. Here, we combined laser-capture microdissection and multiplex RNA-sequencing to map the transcriptome of MGE cells at a spatial resolution of 50 μm.
  81. Topological Pattern Recognition for Point Cloud Data* (2014)

    Gunnar Carlsson
    Abstract In this paper we discuss the adaptation of the methods of homology from algebraic topology to the problem of pattern recognition in point cloud data sets. The method is referred to as persistent homology, and has numerous applications to scientific problems. We discuss the definition and computation of homology in the standard setting of simplicial complexes and topological spaces, then show how one can obtain useful signatures, called barcodes, from finite metric spaces, thought of as sampled from a continuous object. We present several different cases where persistent homology is used, to illustrate the different ways in which the method can be applied.
  82. Reconceiving the Hippocampal Map as a Topological Template (2014)

    Yuri Dabaghian, Vicky L. Brandt, Loren M. Frank
    Abstract The role of the hippocampus in spatial cognition is incontrovertible yet controversial. Place cells, initially thought to be location-specifiers, turn out to respond promiscuously to a wide range of stimuli. Here we test the idea, which we have recently demonstrated in a computational model, that the hippocampal place cells may ultimately be interested in a space's topological qualities (its connectivity) more than its geometry (distances and angles); such higher-order functioning would be more consistent with other known hippocampal functions. We recorded place cell activity in rats exploring morphing linear tracks that allowed us to dissociate the geometry of the track from its topology. The resulting place fields preserved the relative sequence of places visited along the track but did not vary with the metrical features of the track or the direction of the rat's movement. These results suggest a reinterpretation of previous studies and new directions for future experiments.
  83. CD8 T-Cell Reactivity to Islet Antigens Is Unique to Type 1 While CD4 T-Cell Reactivity Exists in Both Type 1 and Type 2 Diabetes (2014)

    Ghanashyam Sarikonda, Jeremy Pettus, Sonal Phatak, Sowbarnika Sachithanantham, Jacqueline F. Miller, Johnna D. Wesley, Eithon Cadag, Ji Chae, Lakshmi Ganesan, Ronna Mallios, Steve Edelman, Bjoern Peters, Matthias von Herrath
    Abstract Previous cross-sectional analyses demonstrated that CD8+ and CD4+ T-cell reactivity to islet-specific antigens was more prevalent in T1D subjects than in healthy donors (HD). Here, we examined T1D-associated epitope-specific CD4+ T-cell cytokine production and autoreactive CD8+ T-cell frequency on a monthly basis for one year in 10 HD, 33 subjects with T1D, and 15 subjects with T2D. Autoreactive CD4+ T-cells from both T1D and T2D subjects produced more IFN-γ when stimulated than cells from HD. In contrast, higher frequencies of islet antigen-specific CD8+ T-cells were detected only in T1D. These observations support the hypothesis that general beta-cell stress drives autoreactive CD4+ T-cell activity while islet over-expression of MHC class I commonly seen in T1D mediates amplification of CD8+ T-cells and more rapid beta-cell loss. In conclusion, CD4+ T-cell autoreactivity appears to be present in both T1D and T2D while autoreactive CD8+ T-cells are unique to T1D. Thus, autoreactive CD8+ cells may serve as a more T1D-specific biomarker.
  84. Topological Methods Reveal High and Low Functioning Neuro-Phenotypes Within Fragile X Syndrome (2014)

    David Romano, Monica Nicolau, Eve-Marie Quintin, Paul K. Mazaika, Amy A. Lightbody, Heather C. Hazlett, Joseph Piven, Gunnar Carlsson, Allan L. Reiss
    Abstract Fragile X syndrome (FXS), due to mutations of the FMR1 gene, is the most common known inherited cause of developmental disability as well as the most common single-gene risk factor for autism. Our goal was to examine variation in brain structure in FXS with topological data analysis (TDA), and to assess how such variation is associated with measures of IQ and autism-related behaviors. To this end, we analyzed imaging and behavioral data from young boys (n = 52; aged 1.57–4.15 years) diagnosed with FXS. Application of topological methods to structural MRI data revealed two large subgroups within the study population. Comparison of these subgroups showed significant between-subgroup neuroanatomical differences similar to those previously reported to distinguish children with FXS from typically developing controls (e.g., enlarged caudate). In addition to neuroanatomy, the groups showed significant differences in IQ and autism severity scores. These results suggest that despite arising from a single gene mutation, FXS may encompass two biologically, and clinically separable phenotypes. In addition, these findings underscore the potential of TDA as a powerful tool in the search for biological phenotypes of neuropsychiatric disorders. Hum Brain Mapp 35:4904–4915, 2014. © 2014 Wiley Periodicals, Inc.
  85. Topological Data Analysis of Escherichia Coli O157:H7 and Non-O157 Survival in Soils (2014)

    Abasiofiok M. Ibekwe, Jincai Ma, David E. Crowley, Ching-Hong Yang, Alexis M. Johnson, Tanya C. Petrossian, Pek Y. Lum
    Abstract Shiga toxin-producing E. coli O157:H7 and non-O157 have been implicated in many foodborne illnesses caused by the consumption of contaminated fresh produce. However, data on their persistence in soils are limited due to the complexity in datasets generated from different environmental variables and bacterial taxa. There is a continuing need to distinguish the various environmental variables and different bacterial groups to understand the relationships among these factors and the pathogen survival. Using an approach called Topological Data Analysis (TDA); we reconstructed the relationship structure of E. coli O157 and non-O157 survival in 32 soils (16 organic and 16 conventionally managed soils) from California (CA) and Arizona (AZ) with a multi-resolution output. In our study, we took a community approach based on total soil microbiome to study community level survival and examining the network of the community as a whole and the relationship between its topology and biological processes. TDA produces a geometric representation of complex data sets. Network analysis showed that Shiga toxin negative strain E. coli O157:H7 4554 survived significantly longer in comparison to E. coli O157:H7 EDL933, while the survival time of E. coli O157:NM was comparable to that of E. coli O157:H7 strain 933 in all of the tested soils. Two non-O157 strains, E. coli O26:H11 and E. coli O103:H2 survived much longer than E. coli O91:H21 and the three strains of E. coli O157. We show that there are complex interactions between E. coli strain survival, microbial community structures, and soil parameters.
  86. Extracting Insights From the Shape of Complex Data Using Topology (2013)

    P. Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson, G. Carlsson
    Abstract This paper applies topological methods to study complex high dimensional data sets by extracting shapes (patterns) and obtaining insights about them. Our method combines the best features of existing standard methodologies such as principal component and cluster analyses to provide a geometric representation of complex data sets. Through this hybrid method, we often find subgroups in data sets that traditional methodologies fail to find. Our method also permits the analysis of individual data sets as well as the analysis of relationships between related data sets. We illustrate the use of our method by applying it to three very different kinds of data, namely gene expression from breast tumors, voting data from the United States House of Representatives and player performance data from the NBA, in each case finding stratifications of the data which are more refined than those produced by standard methods.
  87. Topology Based Data Analysis Identifies a Subgroup of Breast Cancers With a Unique Mutational Profile and Excellent Survival (2011)

    Monica Nicolau, Arnold J. Levine, Gunnar Carlsson
    Abstract High-throughput biological data, whether generated as sequencing, transcriptional microarrays, proteomic, or other means, continues to require analytic methods that address its high dimensional aspects. Because the computational part of data analysis ultimately identifies shape characteristics in the organization of data sets, the mathematics of shape recognition in high dimensions continues to be a crucial part of data analysis. This article introduces a method that extracts information from high-throughput microarray data and, by using topology, provides greater depth of information than current analytic techniques. The method, termed Progression Analysis of Disease (PAD), first identifies robust aspects of cluster analysis, then goes deeper to find a multitude of biologically meaningful shape characteristics in these data. Additionally, because PAD incorporates a visualization tool, it provides a simple picture or graph that can be used to further explore these data. Although PAD can be applied to a wide range of high-throughput data types, it is used here as an example to analyze breast cancer transcriptional data. This identified a unique subgroup of Estrogen Receptor-positive (ER+) breast cancers that express high levels of c-MYB and low levels of innate inflammatory genes. These patients exhibit 100% survival and no metastasis. No supervised step beyond distinction between tumor and healthy patients was used to identify this subtype. The group has a clear and distinct, statistically significant molecular signature, it highlights coherent biology but is invisible to cluster methods, and does not fit into the accepted classification of Luminal A/B, Normal-like subtypes of ER+ breast cancers. We denote the group as c-MYB+ breast cancer.
  88. Structural Insight Into RNA Hairpin Folding Intermediates (2008)

    Gregory R. Bowman, Xuhui Huang, Yuan Yao, Jian Sun, Gunnar Carlsson, Leonidas J. Guibas, Vijay S. Pande
    Abstract , Hairpins are a ubiquitous secondary structure motif in RNA molecules. Despite their simple structure, there is some debate over whether they fold in a two-state or multi-state manner. We have studied the folding of a small tetraloop hairpin using a serial version of replica exchange molecular dynamics on a distributed computing environment. On the basis of these simulations, we have identified a number of intermediates that are consistent with experimental results. We also find that folding is not simply the reverse of high-temperature unfolding and suggest that this may be a general feature of biomolecular folding.
  89. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition (2007)

    Gurjeet Singh, Facundo Mémoli, Gunnar Carlsson
    Abstract We present a computational method for extracting simple descriptions of high dimensional data sets in the form of simplicial complexes. Our method, called Mapper, is based on the idea of partial clustering of the data guided by a set of functions defined on the data. The proposed method is not dependent on any particular clustering algorithm, i.e. any clustering algorithm may be used with Mapper. We implement this method and present a few sample applications in which simple descriptions of the data present important information about its structure.