Causal Network Inference of High-Throughput Data with Structural Equation Models

Tarantino, Barbara

Biological and therapeutic uses for network-based approaches to human illness treatment are numerous. A deeper understanding of the role of cellular interconnection in disease progression may lead to the identification of disease genes and disease pathways, which in turn could lead to the development of better therapeutic targets. Scalable statistical solutions for modeling complex biological systems have become critically important with the introduction of high-throughput sequencing (HTS) in molecular biology and medicine. In order to test new hypotheses and deepen our understanding of physiological processes and diseases, we need to incorporate vast amounts of newly collected heterogeneous data and current knowledge. This difficulty was brought on by the growing number of platforms and potential experimental scenarios. Despite the fact that network theory gave us a framework to explore the hidden features of biological systems and to describe them, diverse algorithms still have low reproducibility and robustness, depend on user-defined configuration, and are difficult to understand. This thesis is divided into seven chapters, including an introduction, a conclusion, and five independent sections that report the related studies. The R package SEMgraph, which combines network analysis and causal inference within the context of Structural Equation Modeling (SEM), is proposed in Chapter 1. It offers a completely automated framework for managing complex biological systems as multivariate networks, ensuring adaptability and accuracy through data-driven model construction and perturbation evaluation, and making it simple to understand in terms of causal relationships between system components. For the analysis of high-dimensional networks, SEMgraph provides a number of algorithms. In particular, Chapter 2 introduces SEMgsa(), a topology-based algorithm created within the SEM framework. It uses statistics of route perturba- tions and topological information to disclose biological information. Compared to some other approaches, SEMgsa() outperforms current software tools and is very sensitive to the disease-specific pathways. SEMtree(), a tree-based structure learning approach with SEM, is introduced in Chapter 3. Starting with the data on the interactome and gene expression, it recovers the tree-based structure. SEMtree(), as compared to other methods, is able to capture biologically significant sub-networks with straightforward directed route visualization, effective perturbation extraction, and good classifier performance. SEMbap(), a two-stage deconfounding method included into the SEM framework and based on the Bow-free Acyclic Paths (BAP) search, is covered in Chapter 4 of the thesis. It deals with unobserved confounding factors to correctly quantify interesting biological signals. When compared to previous approaches, the BAP search algorithm is able to accurately find hidden confounding while limiting the false positive rate, attaining acceptable fitting and perturbation metrics, and other desirable characteristics. In the end, Chapter 5 presents the SEMdag() algorithm, a two-stage order-based search with prior knowledge-based or data-driven approach, under the assumption of a linear SEM with equal variance error terms. Our methodology has been compared to existing literature, showing low computational burden and high classification performance in out-of-sample disease predictions.