Amino acids and their properties are variably distributed in proteins and different compositions determine all protein features, ranging from solubility to stability and functionality. Gini index, a tool to estimate distribution uniformity, is widely used in macroeconomics and has numerous statistical applications. Here, Gini index is used to analyze the distribution of hydrophobicity in proteins and to compare hydrophobicity distribution in globular and intrinsically disordered proteins. Based on the analysis of carefully selected high-quality data sets of proteins extracted from the Protein Data Bank (http://www.rcsb.org) and from the DisProt database (http://www.disprot.org/), it is observed that hydrophobicity is distributed in a more diverse way in intrinsically disordered proteins than in folded and soluble globular proteins. This correlates with the observation that the amino acid composition deviates from the uniformity (estimate with the Shannon and the Gini–Simpson indices) more in intrinsically disordered proteins than in globular and soluble proteins. Although statistical tools tike the Gini index have received little attention in molecular biology, these results show that they allow one to estimate sequence diversity and that they are useful to delineate trends that can hardly be described, otherwise, in simple and concise ways.

Hydrophobicity diversity in globular and nonglobular proteins measured with the gini index

Carugo, Oliviero
2017-01-01

Abstract

Amino acids and their properties are variably distributed in proteins and different compositions determine all protein features, ranging from solubility to stability and functionality. Gini index, a tool to estimate distribution uniformity, is widely used in macroeconomics and has numerous statistical applications. Here, Gini index is used to analyze the distribution of hydrophobicity in proteins and to compare hydrophobicity distribution in globular and intrinsically disordered proteins. Based on the analysis of carefully selected high-quality data sets of proteins extracted from the Protein Data Bank (http://www.rcsb.org) and from the DisProt database (http://www.disprot.org/), it is observed that hydrophobicity is distributed in a more diverse way in intrinsically disordered proteins than in folded and soluble globular proteins. This correlates with the observation that the amino acid composition deviates from the uniformity (estimate with the Shannon and the Gini–Simpson indices) more in intrinsically disordered proteins than in globular and soluble proteins. Although statistical tools tike the Gini index have received little attention in molecular biology, these results show that they allow one to estimate sequence diversity and that they are useful to delineate trends that can hardly be described, otherwise, in simple and concise ways.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11571/1218594
Citazioni
  • ???jsp.display-item.citation.pmc??? 0
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 4
social impact