Precise and efficient retrieval of structural motifs is a task of great interest in proteomics. Geometrical approaches to motif identification allow the retrieval of unknown motifs in unfamiliar proteins that may be missed by wide-spread topological algorithms. In particular, the Cross Motif Search (CMS) algo-rithm analyzes pairs of proteins and retrieves every group of secondary structure elements that is similar between the two proteins. These similarities are candidate to be structural motifs. When extended to large datasets, the exhaustive approach of CMS generates a huge volume of data. Mining the output of CMS means iden-tifying the most significant candidate motifs proposed by the algorithm, in order to determine their biological significance. In the literature, effective data mining on a CMS dataset is an unsolved problem. In this paper, we propose a heuristic approach based on what we call protein “co-occurrences” to guide data mining on the CMS dataset. Preliminary results show that the proposed implementation is computationally efficient and is able to select only a small subset of significant motifs.
Mining Geometrical Motifs Co-occurrences in the CMS Dataset
Marco Ferretti;Mirto Musci
2018-01-01
Abstract
Precise and efficient retrieval of structural motifs is a task of great interest in proteomics. Geometrical approaches to motif identification allow the retrieval of unknown motifs in unfamiliar proteins that may be missed by wide-spread topological algorithms. In particular, the Cross Motif Search (CMS) algo-rithm analyzes pairs of proteins and retrieves every group of secondary structure elements that is similar between the two proteins. These similarities are candidate to be structural motifs. When extended to large datasets, the exhaustive approach of CMS generates a huge volume of data. Mining the output of CMS means iden-tifying the most significant candidate motifs proposed by the algorithm, in order to determine their biological significance. In the literature, effective data mining on a CMS dataset is an unsolved problem. In this paper, we propose a heuristic approach based on what we call protein “co-occurrences” to guide data mining on the CMS dataset. Preliminary results show that the proposed implementation is computationally efficient and is able to select only a small subset of significant motifs.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.