Nowadays, electronic health care data are stored in data structures that are often organized to consistently preserve information but do not make it easily accessible for further applications. Moreover, the collection of these data is often fragmented on the basis on the department of origin. As a consequence, there is a lack of a comprehensive data overview that could be an unexplored source of information in itself. This thesis presents the application of i2b2 as a data platform for data aggregation from various hospital databases and at the same time for their potential use in research applications. Two techniques are presented to exploit the information of data previously stored in the i2b2 data warehouse. First, a Natural Language Processing (NLP) pipeline based on ontologies is exploited for the extraction of information from pathological anatomy reports and, second, a CareFlow Mining (CFM) algorithm is used to embed the information extracted by NLP in the process of discovery patterns of care. This work has been carried on thanks to the collaboration of the Oncology ward of the Hospital ASST Papa Giovanni XXIII in Bergamo. The i2b2 platform was installed within the hospital, in collaboration with Biomeris s.r.l., a spin-off company of the University of Pavia. i2b2 was populated with data taken from different database, and a novel set of software solutions has been implemented to expand the i2b2 functionalities and to support oncology research. The i2b2 population procedure is automatically repeated weekly to guarantee the constant updating of the information. Such procedure includes a careful study and implementation of ETL (Extract, Transform, Load) procedures through the use of the Mirth Connect tool for the manipulation of hospital data via SQL queries. Aside of the main i2b2 project that includes all patients in the hospital, a second vertical Oncology project has been developed for oncology patients only. This vertical project imports all oncology patients data available in the main i2b2 project. The main additional source for the vertical solution was the internal oncology ward database, Oncosys. In addition to this information, the input of new data originating from NLP was included. Furthermore, a study was made for the implementation of an i2b2 plugin dedicated to exploring the care patterns of cancer patients. An NLP pipeline previously implemented has been adapted for extracting information from the pathological anatomy reports, based on an ontology of breast cancer. Taking advantage of the i2b2 database augmented by NLP, it was also possible to run a careflow mining algorithm (CFM), which highlights the temporal relationships between events and identifies different clinical patterns in patients. An i2b2 plugin has been designed, in order to embed CFM as a commodity for clinicians and researchers during data exploration. The carried-out activity responds to the need of clinicians to have easy access to the history of patient treatments and to the need of researchers to be able to define cohorts of patients to support cancer research.
Al giorno doggi i dati elettronici ospedalieri vengono mantenuti in strutture dati che spesso sono organizzati per conservare linformazione ma non la rendono facilmente fruibile per ulteriori applicazioni. Inoltre, allinterno di questi dati si nota una suddivisione netta dellinformazione basata spesso sul dipartimento di provenienza. Manca dunque una visione integrata dei dati che potrebbe dimostrarsi una sorgente inesplorata di informazione. In questa tesi viene presentata lapplicazione di i2b2 come piattaforma dati adatta alla aggregazione di dati provenienti dai vari database ospedalieri e al contempo al loro potenziale utilizzo in applicazioni di ricerca. Vengono presentate inoltre due tecniche per sfruttare linformazione dei dati precedentemente immagazzinati nel data warehouse di i2b2. In particolare, si mostra una tecnica di Natural Language Processing (NLP) basata su ontologie, per lestrazione di informazione da referti di anatomia patologica e lutilizzo di un metodo di Care Flow Mining (CFM) per rendere fruibile linformazione derivante dai pattern di cura dei pazienti. Questo lavoro è basato sulla collaborazione con il reparto di Oncologia dellospedale ASST Papa Giovanni XXIII di Bergamo, in cui è stata installata la piattaforma i2b2. Lattività di tesi di seguito mostrata emerge da diverse collaborazioni. Assieme alla spin off Biomeris delluniversità di Pavia sono stati gestiti linstallazione del database i2b2 allinterno dellospedale, il successivo popolamento coi dati prelevati dai sistemi informativi e i successivi studi per espandere le funzionalità utili alla ricerca in ambito oncologico. La procedura di inserimento viene automaticamente ripetuta settimanalmente per garantire il costante aggiornamento dellinformazione ed è frutto di un attento studio e implementazione di procedure di ETL (Extract, Transform, Load) tramite luso del tool Mirth Connect per la manipolazione dei dati ospedalieri tramite query SQL. In parallelo con la fine del popolamento principale del database, si è avviato un progetto basato interamente sui pazienti oncologici, permesso dalla flessibilità della struttura di i2b2. Mantenendo tutti i dati già importati dai vari database ospedalieri per questi pazienti, si è proceduto ad arricchire il progetto con dati derivanti dal database interno al reparto di oncologia, Oncosys. Oltre a queste informazioni aggiuntive, su questo progetto è stata testata limmissione di nuovi dati originati da NLP ed è stato sviluppato un progetto per limplementazione di un plugin di i2b2 dedicato allesplorazione dei pattern di cura dei pazienti oncologici. Una precedente pipeline NLP è stata riadattata per lestrazione di informazioni dai referti di Anatomia Patologica seguendo una tassonomia creata ad hoc, sul caso di tumore alla mammella. Si è presentata inoltre lopportunità di sviluppare un plugin i2b2, sfruttando un precedente algoritmo per il Careflow Mining, derivato dallo studio e dallapproccio tipico del process mining. Questo metodo crea un grafo aciclico che raggruppa i pattern di eventi più frequenti compiuti dai pazienti. Sfruttando la possibilità di interazione con il database i2b2 si generano dei logs file da cui è possibile evidenziare le relazioni temporali tra eventi e lidentificazione di pattern clinici differenti in pazienti che percorrono eventi diversi. Lattività svolta risponde al bisogno dei clinici di avere accesso facilitato allo storico dei trattamenti dei pazienti e al bisogno dei ricercatori di poter definire coorti di pazienti per approfondirne la ricerca oncologica.
An IT infrastructure to support oncological research empowered by NLP and temporal data mining
CHIUDINELLI, LORENZO
2020-01-30
Abstract
Nowadays, electronic health care data are stored in data structures that are often organized to consistently preserve information but do not make it easily accessible for further applications. Moreover, the collection of these data is often fragmented on the basis on the department of origin. As a consequence, there is a lack of a comprehensive data overview that could be an unexplored source of information in itself. This thesis presents the application of i2b2 as a data platform for data aggregation from various hospital databases and at the same time for their potential use in research applications. Two techniques are presented to exploit the information of data previously stored in the i2b2 data warehouse. First, a Natural Language Processing (NLP) pipeline based on ontologies is exploited for the extraction of information from pathological anatomy reports and, second, a CareFlow Mining (CFM) algorithm is used to embed the information extracted by NLP in the process of discovery patterns of care. This work has been carried on thanks to the collaboration of the Oncology ward of the Hospital ASST Papa Giovanni XXIII in Bergamo. The i2b2 platform was installed within the hospital, in collaboration with Biomeris s.r.l., a spin-off company of the University of Pavia. i2b2 was populated with data taken from different database, and a novel set of software solutions has been implemented to expand the i2b2 functionalities and to support oncology research. The i2b2 population procedure is automatically repeated weekly to guarantee the constant updating of the information. Such procedure includes a careful study and implementation of ETL (Extract, Transform, Load) procedures through the use of the Mirth Connect tool for the manipulation of hospital data via SQL queries. Aside of the main i2b2 project that includes all patients in the hospital, a second vertical Oncology project has been developed for oncology patients only. This vertical project imports all oncology patients data available in the main i2b2 project. The main additional source for the vertical solution was the internal oncology ward database, Oncosys. In addition to this information, the input of new data originating from NLP was included. Furthermore, a study was made for the implementation of an i2b2 plugin dedicated to exploring the care patterns of cancer patients. An NLP pipeline previously implemented has been adapted for extracting information from the pathological anatomy reports, based on an ontology of breast cancer. Taking advantage of the i2b2 database augmented by NLP, it was also possible to run a careflow mining algorithm (CFM), which highlights the temporal relationships between events and identifies different clinical patterns in patients. An i2b2 plugin has been designed, in order to embed CFM as a commodity for clinicians and researchers during data exploration. The carried-out activity responds to the need of clinicians to have easy access to the history of patient treatments and to the need of researchers to be able to define cohorts of patients to support cancer research.File | Dimensione | Formato | |
---|---|---|---|
Tesi Chiudinelli.pdf
Open Access dal 11/08/2021
Descrizione: tesi di dottorato
Dimensione
5.18 MB
Formato
Adobe PDF
|
5.18 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.