Electronic health records represent a great source of valuable information for both patient care and biomedical research. Despite the efforts put into collecting structured data, a lot of information is available only in the form of free text. For this reason, developing systems that automatically extract relevant information from clinical narratives is essential. In addition, summarizing all the data related to one single patient represents an essential task. In the field of clinical information extraction, several systems have been developed, especially for the analysis of texts written in English. However, the related research for non-English languages is still limited. In this research activity, information extraction techniques and summarization methods were applied to the analysis of medical reports written in Italian. For this language, shared resources for clinical information extraction are not easily available. In this work, a corpus of molecular cardiology reports was considered as the main dataset for methods development. Moreover, to enable the design and the evaluation of different approaches, a subset of this corpus was annotated by manually identifying the information to be extracted from the texts. To access the knowledge included in textual medical reports, a first step involves the identification of clinical events. In the natural language processing community, this task is often addressed by using supervised methods. In this research activity, two different approaches were exploited to perform event extraction. First, a simple, yet effective approach based on dictionary lookup was used. Second, an application of recurrent neural networks was investigated. In clinical texts, events are often mentioned together with relevant attributes that have to be extracted to characterize the event itself. In this thesis, an ontology-driven approach was used to identify events’ attributes in the cardiology reports. In particular, a domain-specific ontology was manually developed, including all the relevant events with their associated attributes. As the gold standard for the evaluation phase, a hospital database, which stores most of the information written in the reports, was exploited. As another important task, to correctly reconstruct patients’ clinical histories, it is necessary to assign a specific time to each event extracted from the text. To this end, the identification of temporal expressions is a first, mandatory step. In this research activity, two existing rule-based systems for temporal information extraction were adapted to the analysis of clinical narratives. To process each document, the three illustrated steps (event, attribute, and time expression extraction) were aggregated into a pipeline. As an important remark, for each event and temporal expression identified in the text, the pipeline extracts a few properties of interest, too. Among these properties, the temporal relation between each event and the document creation time is computed (DocTimeRel). On the basis of this relation, each event is further linked to a reference time by applying a set of hand-crafted rules. Besides processing single medical reports, the system developed in this research activity is able to summarize multiple documents referred to the same patient. In this case, the information extraction pipeline is initially run on all the documents belonging to that patient. Then, the system builds and visualizes a timeline of all the extracted events, exploiting the DocTimeRel information and the event-time links. As regards the system’s evaluation, the overall information extraction pipeline performed well on the considered Italian cardiology corpus. In addition, the possibility to adapt the attribute extraction step to the analysis of another language was assessed, with promising results. In a similar way, the developed ontology was adapted to the analysis of another clinical domain, leading to a well-performing system.
Information Extraction from Medical Reports in the Italian Language for Clinical Timelines Reconstruction
VIANI, NATALIA
2018-01-26
Abstract
Electronic health records represent a great source of valuable information for both patient care and biomedical research. Despite the efforts put into collecting structured data, a lot of information is available only in the form of free text. For this reason, developing systems that automatically extract relevant information from clinical narratives is essential. In addition, summarizing all the data related to one single patient represents an essential task. In the field of clinical information extraction, several systems have been developed, especially for the analysis of texts written in English. However, the related research for non-English languages is still limited. In this research activity, information extraction techniques and summarization methods were applied to the analysis of medical reports written in Italian. For this language, shared resources for clinical information extraction are not easily available. In this work, a corpus of molecular cardiology reports was considered as the main dataset for methods development. Moreover, to enable the design and the evaluation of different approaches, a subset of this corpus was annotated by manually identifying the information to be extracted from the texts. To access the knowledge included in textual medical reports, a first step involves the identification of clinical events. In the natural language processing community, this task is often addressed by using supervised methods. In this research activity, two different approaches were exploited to perform event extraction. First, a simple, yet effective approach based on dictionary lookup was used. Second, an application of recurrent neural networks was investigated. In clinical texts, events are often mentioned together with relevant attributes that have to be extracted to characterize the event itself. In this thesis, an ontology-driven approach was used to identify events’ attributes in the cardiology reports. In particular, a domain-specific ontology was manually developed, including all the relevant events with their associated attributes. As the gold standard for the evaluation phase, a hospital database, which stores most of the information written in the reports, was exploited. As another important task, to correctly reconstruct patients’ clinical histories, it is necessary to assign a specific time to each event extracted from the text. To this end, the identification of temporal expressions is a first, mandatory step. In this research activity, two existing rule-based systems for temporal information extraction were adapted to the analysis of clinical narratives. To process each document, the three illustrated steps (event, attribute, and time expression extraction) were aggregated into a pipeline. As an important remark, for each event and temporal expression identified in the text, the pipeline extracts a few properties of interest, too. Among these properties, the temporal relation between each event and the document creation time is computed (DocTimeRel). On the basis of this relation, each event is further linked to a reference time by applying a set of hand-crafted rules. Besides processing single medical reports, the system developed in this research activity is able to summarize multiple documents referred to the same patient. In this case, the information extraction pipeline is initially run on all the documents belonging to that patient. Then, the system builds and visualizes a timeline of all the extracted events, exploiting the DocTimeRel information and the event-time links. As regards the system’s evaluation, the overall information extraction pipeline performed well on the considered Italian cardiology corpus. In addition, the possibility to adapt the attribute extraction step to the analysis of another language was assessed, with promising results. In a similar way, the developed ontology was adapted to the analysis of another clinical domain, leading to a well-performing system.File | Dimensione | Formato | |
---|---|---|---|
PhdThesisNV_revised_final.pdf
accesso aperto
Descrizione: tesi di dottorato
Dimensione
4.51 MB
Formato
Adobe PDF
|
4.51 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.