The Chromosome-centric Human Proteome Project (C-HPP), an initiative of the Human Proteome Organization (HUPO), seeks to identify all proteins encoded by genes as a bridge between the worlds of genomics and proteomics. However, problems exist in that researchers have not yet identified some predicted gene products. Choong et al. (2015) examined possible informatics issues contributing to this problem through systematic analysis of an “ideal” shotgun proteomics experiment.1 The researchers conducted their inquiry using in silico modeling in place of real-life mass spectrometric characterization.
The team first determined the extent of the problem, calculating the number of proteins described as missing from analysis of neXtProt, the human protein knowledge database that researchers use for identification following mass spectrometry. They then referred to neXtProt and other databases, including UniProt, to examine peptides arising from in silico digestion using a number of proteases, either alone or in combination. These enzymes included standard mass spectrometric proteomics reagents such as trypsin and Lys-C. The team then examined the in silico digests, taking note of sequence homologies, length and uniqueness, in an effort to understand why predicted gene products should evade identification.
For unambiguous identification, Choong et al. note that proteins should produce at least one unique peptide following digestion, rather than sharing peptide identities with other proteins. Analysis of the protein and peptide knowledge bases showed that 53% of peptides belong to more than one protein. The team noted that unique peptides tended to possess longer sequences than shared peptides.
The team explored the protein databases further and identified 20,053 proteins in both UniProt and neXtProt for further analysis. Of these predicted proteins, 19,908 had at least one unique peptide for identification. The team classified 77 of the remainder (n = 145) that shared peptides as missing. Furthermore, Choong et al. noted that 58 could be identified only from shared peptides and were therefore ambiguous.
During the analysis of issues that prevent conclusive protein identification through bioinformatics tools, the research team found three problems that contribute to the status of “missing”:
- Lack of unique peptide identifier through sequence similarity
- Protein location and origin
- Genetic sequence similarity
The researchers noted that missing proteins may not be present in commonly assayed tissues or might be expressed only during development or disease, for example. Furthermore, proteins located within cell membranes are traditionally harder to detect, and this could contribute to their missing status. Looking at the genomic data, Choong et al. note that considering the effect of sequence variants and single nuclear polymorphisms might also affect conclusive identification.
Identification is undeniably easier when a unique peptide is present. However, the researchers suggest that identification of missing proteins is still possible but requires a combination of techniques: mass spectrometric proteomics evaluation should be used alongside immunological tools, for example. They also suggest that a combination of protease enzymes could yield more coverage for mass spectrometry. Additionally, the team proposed that considering top-down proteomics for intact protein evaluation and including the FASTA protein database for sequence variance interpretation might improve results. Finally, when faced with ambiguity in proteomics results, scientists should also consider 3-D structural assessment and functional assays for more comprehensive data.
The authors suggest that through understanding the limitations inherent in bioinformatics tools, researchers can plan ahead to maximize coverage for conclusive protein identification.
1. Choong, W.-K., et al. (2015) “Informatics view on the challenges of identifying missing proteins from shotgun proteomics,” Journal of Proteome Research, 14(12) (pp. 5396–5407), doi: 10.1021/acs.jproteome.5b00482.
Post Author: Amanda Maxwell. Amanda is a freelance science writer and digital space explorer with a passionate curiosity for science and technology. She enjoys translating complex theories and subjects creatively into everyday language for all audiences. Equipped with a bachelor’s degree in veterinary medicine and a PhD in protein chemistry/small animal critical care nutrition, she brings clinical experience and practical research oversight into her writing.
The post Where’s the Squeeze? Informatics Bottlenecks in Missing Protein Identification appeared first on Accelerating Proteomics.