HELPER: UNA PIATTAFORMA BIOINFORMATICA PER LA PERSONALIZZAZIONE DELLE PIPELINE NGS

Urtis, Mario

Next generation sequencing (NGS) technologies have revolutionized the world of genetics and medicine, strongly influencing the diagnosis of hereditary diseases. The large number of applications, both diagnosis, and research, has generated the need to adapt the analysis of the data produced by these technologies to optimize the clinical path of many human diseases. The analysis process is implemented through consecutive modifications of the genetic data (pipeline) using bioinformatics tools and software. Often, the performance of the different tools depends on the type of input data; the integration of software suitable for different types of data is a critical step for the quality of the information produced. Furthermore, the use of bioinformatics tools, their configuration, the design of robust pipelines, and the development of new analysis solutions is a complex process that requires coding skills and knowledge of the wide range of existing tools. In this context, bioinformaticians achieved a key role within genetics laboratories, thanks to the skills of developing computer systems combined with the integration of knowledge on target biology systems and related applications; these “in house” tailored activities favor the adaptation of the analyses to each specific questions/objectives. Laboratories using outsourcing analysis tools or entrusting to commercial software that apply the same rules and systems to all genes, without distinction, often face difficult optimization of the analytical workflow. Hence, the growing need for simple and fast tools that can support professionals with limited computer skills in the design of customized pipelines and their use to analyze NGS data. During the PhD course carried out at the Center for Cardiovascular Genetic Diseases of the San Matteo Hospital in Pavia, the Helper platform was developed. Helper was born for the design and simplified adaptation of bioinformatics pipelines for the analysis of NGS data derived from targeted sequencing applications. Helper is equipped with a simple graphic interface aimed at facilitating the development experience of bioinformatics analytical processes even for professionals who do not have coding knowledge. Helper allows the selection of: the steps to carry out (or to avoid) in the analysis workflow; the tools and software to use in each selected step; the arguments to set the tools employed in each application. Helper further allows the use of the pipelines, the design and carrying out of the analysis of NGS data; it can be modified based on the sequencing experiment from which the samples are derived, and on the basis of the organization of the samples. Helper can be used both on a workstation and on a common PC, proving to be compatible with the analysis times of the genetics laboratories even in the presence of solutions with low computational capacity. In the genetic analysis workflow, Helper is part of the process of translating raw NGS data into a set of variants useful for the interpretation of the genetic test. The thesis finally aimed at addressing two fundamental questions for genetic diagnosis. The first question addresses the complex issue of the variant classification as identified by bioinformatics analysis. The classification of genetic variants is a process that reflects difficulties in finding uniform and robust rules shared by all genes. In this thesis, a classification system is proposed for the variants of the DES gene, which takes into consideration the specific characteristics of the gene encoding the Desmin protein. The second question addressed the identification of the genes responsible for a specific phenotype, necessary for the optimization of the diagnostic test and for patient management. In this context, hereditary breast and ovarian tumors is investigated through the study of the results of the analysis of the genetic database developed at San Matteo for monitoring the genetic basis of oncological diseases.

Le tecnologie NGS hanno rivoluzionato il mondo della genetica e della medicina, influenzando fortemente la diagnosi delle malattie ereditarie. Il grande numero di applicazioni, sia di diagnostica che di ricerca, ha generato la necessità di adattare l’analisi dei dati prodotti da queste tecnologie per ottimizzare la risposta ai problemi specifici. Il processo di analisi è implementato tramite trasformazioni consecutive dei dati genetici (pipeline) utilizzando un grande numero di tool e software bioinformatici. Spesso le performance dei diversi tool dipendono dal tipo dei dati in ingresso e l’integrazione dei software adatti ai diversi tipi di dati è diventato un passaggio critico per la qualità delle informazioni prodotte. Inoltre, l’utilizzo dei tool, la loro configurazione, la progettazione di pipeline robuste e lo sviluppo di nuove soluzioni di analisi, sono processi complessi che richiedono competenze di coding e la conoscenza dell’esteso panorama bioinformatico. I laboratori che non dispongono di personale specializzato in applicazioni bioinformatiche, possono incontrare difficoltà nell’ottimizzazione del workflow analitico, che spesso viene affidato a software commerciali che applicano uguali regole e sistemi a tutti i geni indistintamente. Durante il percorso di dottorato di ricerca effettuato presso il Centro malattie genetiche cardiovascolari dell’ospedale San Matteo di Pavia, è stata sviluppata la piattaforma Helper. Helper è nata per la progettazione e l’adattamento semplificato delle pipeline bioinformatiche dedicate all’analisi di dati NGS derivati da applicazioni di targeted sequencing. Helper è dotato di una semplice interfaccia grafica mirata a facilitare l’esperienza di sviluppo dei processi analitici bioinformatici anche per chi non possiede particolari conoscenze di sviluppo di codice. Tramite Helper è possibile scegliere quali step effettuare nel workflow di analisi e quali evitare, quali tools e software utilizzare in ogni step selezionato, e con quali argomenti settare i tool utilizzati. Helper permette inoltre di utilizzare le pipeline, progettate ed effettuare l’analisi dei dati NGS, ed è modificabile in base all’esperimento di sequenziamento dal quale derivano i campioni e in base al tipo e all’organizzazione dei campioni. Helper può essere utilizzato sia su workstation, sia su un comune PC, dimostrandosi compatibile con i tempi di analisi dei laboratori di genetica anche in presenza di soluzioni a bassa capacità computazionale. Nel workflow di analisi genetica, Helper è dedicato a quella che è definita come analisi secondaria, che trasforma i dati NGS grezzi in un set di varianti utili all’interpretazione del test genetico. Il lavoro di tesi si è proposto inoltre di introdurre due questioni fondamentali per la diagnosi genetica. La prima è rappresentata dal problema della classificazione patogenica delle varianti identificate dall’analisi bioinformatica. La classificazione delle varianti è un processo delicato a causa della difficoltà esistenti nel trovare regole uniformi e robuste da applicare a tutti i difetti genici. In questa tesi viene proposto un sistema di classificazione per le varianti del gene DES, che prende in considerazione le caratteristiche specifiche del gene che codifica per la proteina di Desmina. Il secondo è l’identificazione dei geni responsabili di un determinato fenotipo, necessaria per l’ottimizzazione del test diagnostico e per la gestione dei pazienti. In questo contesto viene approfondito il problema dei tumori ereditari della mammella e dell’ovaio, tramite lo studio dei risultati di analisi del database genetico sviluppato presso il San Matteo per il monitoraggio delle cause genetiche delle patologie oncologiche.