MicroRNA are small non-coding molecules that act as post-transcriptional regulators of gene expression in a wide spectrum of biological states. Mostly, the information about microRNA is embedded in unstructured data (text files) which needs specific text mining techniques for its retrieval and analysis. These are generally based on supervised (or semi-supervised) learning methods, which require collections of neatly annotated and categorised training data. In this study we propose a comprehensive granular annotation protocol for the annotation of non-coding RNA molecules, focusing primarily on microRNA mentions. This annotation protocol was used to construct a manually annotated corpus (MiNCor Gold) for microRNA mentions as well as a large semi-automatically generated microRNA mentions silver standard corpus (MiNCor Silver) and a large microRNA name dictionary. Therefore, the efficiency of these standards was evaluated using a named entity recognition (NER) system in comparison with another microRNA mentions standard freely available online. The NER system trained with our silver corpus showed a better performance, with higher precision (96,67% vs. 94,00%) and recall (97,57% vs. 95,00%) on their test data and on our (precision 89,26% vs. 88,97% and recall 90,03% vs. 86,74%). The corpora and guidelines are freely downloadable at http://zope.bioinfo.cnio.es/ mincor/minacor.tar.gz.
Annotation process, guidelines and text corpus of small non-coding RNA molecules: The MiNCor for microRNA annotations
Sammartino J. C.;
2016-01-01
Abstract
MicroRNA are small non-coding molecules that act as post-transcriptional regulators of gene expression in a wide spectrum of biological states. Mostly, the information about microRNA is embedded in unstructured data (text files) which needs specific text mining techniques for its retrieval and analysis. These are generally based on supervised (or semi-supervised) learning methods, which require collections of neatly annotated and categorised training data. In this study we propose a comprehensive granular annotation protocol for the annotation of non-coding RNA molecules, focusing primarily on microRNA mentions. This annotation protocol was used to construct a manually annotated corpus (MiNCor Gold) for microRNA mentions as well as a large semi-automatically generated microRNA mentions silver standard corpus (MiNCor Silver) and a large microRNA name dictionary. Therefore, the efficiency of these standards was evaluated using a named entity recognition (NER) system in comparison with another microRNA mentions standard freely available online. The NER system trained with our silver corpus showed a better performance, with higher precision (96,67% vs. 94,00%) and recall (97,57% vs. 95,00%) on their test data and on our (precision 89,26% vs. 88,97% and recall 90,03% vs. 86,74%). The corpora and guidelines are freely downloadable at http://zope.bioinfo.cnio.es/ mincor/minacor.tar.gz.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.