Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow


Development of full pI range HiRIEF LC-MS

To increase the potential for discovery of unannotated coding regions in a proteogenomics experiment, in-depth analysis of the proteome is needed. We employed immobilized pH gradient (IPG) strips that cover the full peptide pI range to enable the exploration of the entire tryptic peptidome (Fig. 1a). As a model system, we used the A431 cell line (TMT10-plex labeled peptides of a time course treatment with an EGFR inhibitor, Supplementary Figure 1). The previously established1 pI range 3.7–4.9 allowed the identification of 93,318 unique peptides when searching against the canonical database. The newly employed pI range 3–10 added 75,825 unique peptides to the results, and further addition of pI ranges 6–9 and 6–11, an additional 34,497 unique peptides (Supplementary Figure 2). Since the combination of pI ranges 3.7–4.9 and 3–10 achieved the bulk of the analytical depth, we applied only these two strips in a subsequent analysis of the normal tissue samples (Supplementary Figure 1). A total of 203,640 and 169,471 peptides were identified (peptide level FDR 1%) from the A431 and normal tissues experiments, respectively, corresponding to 10,166 and 10,889 genes (protein-level FDR 1%) (Table 1). Combining peptides identified from the two experiments, we covered half of the fully tryptic human peptides existing in the PeptideAtlas repository (release 2017–01) and added 21,619 human peptides not present in that repository (Fig. 1b).

Fig. 1
Fig. 1

Full pI range (3–10) HiRIEF provides broad peptidome and proteome coverage. a The top panel shows the comparison between experimental and theoretical pI distributions of TMT-labeled peptides from the A431 cell line data set. The six major peaks in the theoretical pI distribution represent groups of peptides with characteristic amino acid compositions. For example, peptides with a higher number of Asp (D) and Glu (E) residues than the total number of Lys (K), Arg (R), and His (H) will have a pI between 3.5 and 5. The middle panel shows the accuracy of pI prediction by the PredpI algorithm1 across the full pI range. The bottom panel shows the experimental pI ranges of the four IPG strips employed in this study. Nominal pH ranges are indicated on the left side with actual pH ranges next to the bars. See Supplementary Figures 4, 6, and 7 for pI fraction resolution, reproducibility and yield. b Overlap of identified fully tryptic human peptides (at protein level FDR 1%) between the A431 cells data set, the normal tissues data set and the public peptide repository PeptideAtlas (release 2017-01)

Table 1 Identification statistics from the standard proteome search

To benchmark our data, we compared these results to those from corresponding tissues (placenta, kidney, liver, tonsil, and testis) in the Wilhelm et al. draft proteome work21 by re-analyzing their raw data with the same search pipeline (MSGF+, Percolator, with peptide and protein-level FDR 1%) as employed for our data set. We identified 56% more peptides (169,471 vs. 108,402) and 24% more genes with protein products (10,889 vs. 8796), and consequently, protein identifications were generally backed up by larger numbers of unique peptides (i.e., 90 vs. 75% of protein identifications supported by at least 2 peptides) (Supplementary Figure 3, which includes MS run time comparison). Additionally, we performed a simple test and verified that not a single peptide from olfactory receptor proteins could be found in either of our data sets22.

The obtained experimental peptide pI data permitted the upgrade of the PredpI algorithm1 with updated pK constants (available in Supplementary Data 1) now valid for the full pI range. Applying PredpI prediction to peptide sequences from an in silico tryptic digestion of the Ensembl75 human protein database enabled the alignment of the theoretical tryptic peptide distribution over the full pI range with the experimental one (Fig. 1a). The low experimental yield of peptides with pI ~8.5 could perhaps be explained by polyacrylamide gel instability in the most alkaline range23. For the experimental distribution, only tightly focused peptides (i.e., present in only one or two consecutive fractions, on average 80% of all peptides in the used IPG ranges, Supplementary Figure 4) were considered. This tryptic-peptide pI distribution appears to be ubiquitous to all organisms (Supplementary Figure 5) and based on the acid-base chemistry of amino acid residues and the similar distribution of amino acid frequencies in all species24. Peptide fractionation was reproducible across samples and IPG ranges (Supplementary Figure 6) and the unique peptide yield of pI fractions was characteristic to each IPG strip used (Supplementary Figure 7).

A three-stage proteogenomics workflow

To tackle the high rate of false discoveries in proteogenomics findings while taking into account the recent recommendations in the field3, 19, we developed IPAW, a proteogenomics workflow to identify and curate novel peptides from undiscovered protein-coding loci as well as variant peptides coded by nsSNPs and mutations (Fig. 2). The workflow contains three major stages: discovery, curation, and validation. To illustrate the discovery stage, we carried out two types of proteogenomics searches, the VarDB concatenated database search (a customized database search strategy applicable for any high-resolution LC-MS/MS data) (type 1 search in Fig. 2) and the 6FT search, which uses peptide pI values to restrict database size (applicable if pI-based fractionation is used)1, 25 (type 2 search in Fig. 2). It is also possible to use any customized databases generated by external tools14,15,16,17. All MS/MS spectra were searched by MS-GF+26, and post processed with Percolator27 under the Galaxy platform using a separate target-decoy strategy. The human VarDB consists of peptides originating from previously annotated nsSNPs, somatic mutations, pseudogenes, and long non-coding RNAs, forming a supplementary set of peptides to the canonical proteome. Customized database searches have the advantage over 6FT searches in that they allow detection of splice-junction-spanning novel peptides and have a relatively smaller search space, while the 6FT searches allow unbiased screening of protein-coding sequences in the whole genome. The discovery stage outputs a list of candidate novel and SAAV peptides with 1% class-specific FDR3.

Fig. 2
Fig. 2

A proteogenomics workflow to discover, curate, and validate novel and SAAV peptides. The pipeline consists of three major stages: discovery, curation, and validation. The discovery stage is performed with MS-GF+ using two database strategies. Type 1 search was performed against a single database consisting of known peptides concatenated with variant peptides. Type 2 search is enabled by HiRIEF peptide fractionation and was performed against pI-restricted databases of tryptic peptides generated from a six-frame translation (6FT) of the human genome. The discovery stage outputs 1% class-specific FDR for novel and SAAV peptides. In the curation stage, candidate SAAV peptides are curated by SpectrumAI. The novelty of candidate novel peptides from the discovery stage is ensured by BLASTP analysis against known protein databases including Uniprot reference proteome (with isoforms), Ensembl human protein database v83, RefSeq and Gencode v24, and the subset of novel peptides with single amino acid substitution are also curated by SpectrumAI. In the validation stage, quality control plots such as delta pI, precursor mass error, and search engine score distribution are made. In addition, curated novel peptides are evaluated for orthogonal data support in, e.g., RNA-seq data, ribosome profiling and CAGE data, conservation and coding potential prediction

At the curation stage, some of the candidate novel peptides are removed. First, peptides are processed through BLASTP28 to remove exact matches to the known protein database, which combined known human proteins from the Uniprot reference proteome (UP000005640), Ensembl 83, Refseq, and Gencode v24. Thereafter, the subset of novel and SAAV peptides with single amino acid changes compared to known peptides are curated by a tool, SpectrumAI, which only keeps those peptides whose MS2 spectrum contains product ions directly flanking both sides of the substituted residue. We then use BLAT29 to remove peptides mapping to multiple locations in the genome. Even though these peptides are as experimentally valid as other peptides, we decided not to prioritize them because of the increased difficulty in assigning orthogonal support and in drawing biological conclusions from proteins with uncertain genomic locations.

In the validation stage, curated novel peptides are assessed by distribution plots of intrinsic properties such as delta pI (pI difference between experimental and predicted pI values), precursor mass error, and match score. Finally, the curated novel peptides are examined in the context of their genomic positions by cross validation against multiple levels of orthogonal data. These levels include: (i) RNA-seq acquired from the same samples, (ii) public domain Ribo-seq for support of alternative translation initiation sites (TIS)30, 31, (iii) public domain CAGE data for evidence of transcription start sites (TSS)32, (iv) vertebrate conservation (PhyloP)33, (v) codon substitution frequency (PhyloCSF)34, which predicts coding potential of DNA sequences based on codon composition and substitution frequency by comparative genomic analysis, and (vi) MS data from the two draft proteome publications21, 35. It should be noted that the validation stage does not use any of these orthogonal data as filtering criteria to discard proteogenomic findings, but only provides a means for users to rank candidate novel peptides based on orthogonal support.

SpectrumAI verifies single amino acid substitutions

Novel and SAAV peptides were output by 1% class-specific FDR3, which is essential but insufficient to guarantee a de facto 1% false discovery rate. In practice, unexpected isobaric modifications or amino acids reshuffles in known peptides could be mistakenly identified as single-substitution peptides, since the identification of peptide sequence is a pattern matching process, where the search engine only guarantees the best sequence match available in the database. Due to this uncertainty, novel peptides with single substitution were discarded in a recent proteogenomics study20. Nonetheless, manual inspection of MS/MS spectra should allow for distinction between correct and incorrect assignments. However, this is a laborious process and therefore inapplicable to large data sets. A previously published tool, CAMV36, allows computer-aided manual inspection on peptide spectra matches. Although it reduces workload of manual validation, it still needs a mass spectrometry expert to make the final call to accept or reject. CAMV is more suitable to validate a limited list of peptides. For peptides with single amino acid changes, we developed a tool, SpectrumAI, which automatically eliminates incorrect peptides by inspecting MS/MS spectra for flanking product ions supporting the amino acid substitution. We demonstrated the ability of SpectrumAI to eliminate incorrect single substitution peptide identifications via (i) analysis of the distribution of precursor mass error, (ii) orthogonal evidence at DNA and RNA level, and (iii) validation using the corresponding synthetic peptides.

In the A431 data set, 4492 SAAV peptides were identified at 1% class-specific FDR. Of these, 1332 have MS/MS ions flanking the substitution and thus passed SpectrumAI curation (Supplementary Data 2). Comparing the distribution of precursor mass errors, curated peptides showed a distribution similar to that of identified known peptides, whereas discarded peptides did not (Fig. 3a). Importantly, our observation that search engine score (SpecEvalue) distributions of curated peptides and discarded peptides were similar and indicates the need for independent evidence in addition to strict FDR control, which itself depends primarily on search engine score (Supplementary Figure 8). Further, we validated the SAAV peptides using genomics and transcriptomics data. About 44% of peptides passing SpectrumAI curation were supported either at RNA or DNA level, whereas only 3% of discarded peptides had such support (Fig. 3b). We compared the frequency of amino acid changes of the 735 curated SAAVs lacking orthogonal support with that of the 426 supported both at DNA and RNA level, and observed that certain types of amino acid changes (the top one being glutamine to glutamic acid) were over-represented (Supplementary Figure 9). A plausible explanation could be that some curated SAAV peptides are chemical artifacts, arising by, e.g., deamidation.

Fig. 3
Fig. 3

SpectrumAI increases identification accuracy of peptides with single amino acid changes. a Precursor mass error distributions of peptides classified as curated and discarded by SpectrumAI. b Curated SAAV peptides have more overlap with missense variants identified at DNA and RNA level. c Mirror plot of an incorrectly identified peptide (that yet had passed discovery stage with class-specific FDR 1%) with a single residue substitution (V > L, at position 8) that was subsequently discarded by SpectrumAI. Annotated MS2 spectrum of the endogenous peptide is shown on top, whereas that of the respective synthetic peptide is inverted and shown on bottom. This incorrect peptide identification detected by SpectrumAI shows mismatching b6 and b7 product ions (highlighted in the synthetic side and missing in the endogenous side), which ought to have flanked the substituted residue, indicating that the endogenous amino acid sequence is incorrect between its sixth and eighth residues

Finally, we selected 30 peptides with single amino acid substitution from the discovery stage (of which 19 passed SpectrumAI curation, whereas 11 did not), and purchased the corresponding synthetic peptides for validation. Manual inspection of the mirror plots (Fig. 3c and Supplementary Data 3, pp. 1–30) confirmed the 19 SpectrumAI curated peptides to be correct, and the 11 discarded ones to be incorrect.

Novel peptides from unannotated protein-coding loci

IPAW was applied to discover unannotated protein-coding loci in the human genome using both the A431 cells data set and the normal tissues data set. The combined result of 6FT and VarDB searches (Supplementary Figure 10), at the discovery stage, yielded 710 and 295 novel peptides (class-specific FDR 1%) in A431 cells and normal tissues, respectively, of which 426 and 155 passed the curation stage (Supplementary Figure 11 and Supplementary Data 4 and 5). Precursor mass error, SpecEvalue and delta pI distributions of these curated peptides were assessed in comparison to those of known canonical peptides (Supplementary Figures 1214). Novel peptides within genomic proximity of 10 kb were grouped and considered to belong to the same locus. Thus, we identified 374 and 140 unknown protein-coding genomic loci in A431 cells and normal tissues, respectively, of which 42 and 13 were supported by two or more peptides (Fig. 4a and Supplementary Figure 15a). Novel peptides were categorized into eight different classes: pseudogene, 5′ un-translated region (UTR), intronic, AltORF, ncRNA, exon extension, intergenic, and 3′ UTR (Fig. 4b and Supplementary Figure 15b) and were found in all chromosomes across the genome (Fig. 4c and Supplementary Figure 15c).

Fig. 4
Fig. 4

Unannotated protein-coding loci found in the A431 cells data set. a The left pie chart shows the number of unannotated protein-coding loci supported by one, two, or more peptides (peptides within 10 kb distance were grouped into one locus); the right pie chart shows the different types of unannotated coding events supported by multiple peptides. b Automatic categorization of novel peptides by annovar using RefSeq gene annotation (see “Methods”). c Manhattan plot of novel peptides, where the y-axis represents the peptide’s posterior error probability (PEP). d Orthogonal data support for novel peptides, including PhyloCSF coding potential, conservation analysis, A431 cell line RNA-seq reads evidence, ribosome profiling, CAGE (up to 500 bp upstream from peptide location), presence of neighboring peptides (within 10 kb), and whether the peptide was identified in the draft proteome data of Kim et al.35 and Wilhelm et al.21. Continuous variables were discretized to binary values 0 or 1 for visualization purposes. 10,000 random genomic loci were used to determine the threshold to call if Ribo-seq or CAGE data were supportive or not (see Supplementary Figure 20). e The conservation score (PhastCons68 score) distribution of pseudogenes and lncRNAs for which peptides were found was compared to that of 1000 randomly selected pseudogenes and lncRNAs. In the box plots, center line corresponds to median, box boundaries correspond to the first and third quartiles (Q1 and Q3), the upper whisker is min(max(x), Q3 + 1.5 × IQR) and lower whisker is max(min(x), Q1–1.5 × IQR)

The largest group of novel peptides identified belonged to pseudogenes, which, given the previous observation of thousands of pseudogenes found expressed at RNA level in cancer37, is maybe to be expected, particularly in cancer cell lines such as A431. There were 30 novel peptides identified from sORFs (the majority being AltORFs) in known coding transcripts after examination of the genomic context. A mass spectrometry study performed by Slavoff et al. identified 90 sORF-encoded peptides in human cells11. Four peptides were common between our list and their study despite the fact that they used a different cell line (human leukemia cell line K562) and workflow. In our present study, regarding protein N-terminal extensions, 45 out of 50 cases used non-AUG near-cognate translation start codons, the majority (30 cases) of which was within a strong Kozak motif (Supplementary Data 4). A non-AUG TIS within a strong (or occasionally moderate) but not optimal Kozak sequence may be characteristic of protein N-terminal extensions. The usage of non-AUG start codons is corroborated by the work of Fritsch et al., a ribosomal footprinting-based study, which reported that only 1% of protein N-terminal extensions used canonical AUG start codons30. Novel peptides belonging to AltORFs and protein N-terminal extensions showed stronger support from orthogonal data compared to novel peptides from other groups (Fig. 4d and Supplementary Figure 15d). Although most peptides mapping to pseudogenes in the current study lacked RNA-seq read coverage, external evidence from ribosome profiling and CAGE was more supportive. Moreover, conservation scores of pseudogenes/non-coding RNAs with novel peptides identified were significantly higher than those of 1000 randomly selected pseudogenes/non-coding RNAs (Fig. 4e).

Since signal peptides are usually located in the N termini of proteins, we tested the hypothesis of whether an N-terminal extension could alter protein subcellular localization38 by using the TargetP algorithm (v1.1)39 to predict subcellular localization. Three of the discovered protein N-terminal extensions, in genes CDK16 (identified by 2 peptides), NPLOC4 (2 peptides), and THOP1 (2 peptides), showed a high likelihood of being targeted to the mitochondria while the respective canonical proteins have cytosolic or nuclear location (Supplementary Data 6). This result suggests that, for certain genes, the existence of an upstream TIS in the same reading frame as the canonical AUG start site may be used to submit the subcellular location of the protein products to control at the translation level.Since the references were not cited in numerical order, they have been renumbered in the order of appearance. Please check.I have checked that all the references are cited in the appropriate place in the text.

Other types of events found include protein products of exon extensions and intron retentions. Interesting examples of the latter are the translations of alternatively spliced (or miss-spliced) mRNAs from EGFR (identified by 2 peptides) and AP1S1 (2 peptides). The resulting polypeptide products contain part of the canonical N terminus followed by a completely new intron-derived C-terminal sequence, 181 and 42 amino acids long for EGFR and AP1S1, respectively. It is unclear whether such observations are errors of splicing machinery or a functional cellular response to stress conditions.

Finally, to further validate the novel peptides, 100 synthetic peptides in addition to the 30 mentioned above were purchased and analyzed. Out of these 100 synthetic peptides, 2 failed to generate good fragmentation spectra, 7 demonstrated their endogenous counterparts to be incorrect identifications upon manual inspection, and the remaining 91 novel peptides were successfully validated (Supplementary Data 3, pp. 31–128). The significant number (7 out of 98) of incorrect novel peptides highlights the need for further efforts into curation/validation of findings in the proteogenomics field.

For submission to UniProt, 40 unannotated coding loci with multiple peptides that could be linked to in-frame initiation start codons were included (Supplementary Data 7, with selected examples showing each event type in Fig. 5).

Fig. 5
Fig. 5

Examples of unannotated protein-coding regions discovered. Gray lines indicate introns, black thick lines are UTRs, colored boxes are coding regions (color indicates reading frame). Novel peptides are shown as red boxes unless they are in different reading frames. a Pseudogene TATDN2P1 protein identified with two novel peptides linked in the same open reading frame. b LncRNA ENSG00000267943 protein identified with four novel peptides. c An alternative reading frame protein of the DRAP1 gene was identified with four novel peptides. The color of exons and novel peptides indicates reading frame. Exons and peptides in same colors (darker shade for peptides) are in the same reading frame. d Alternative protein N terminus for gene C1orf122 was identified with two novel peptides. e Two novel peptides serving as evidence for the existence of “retained intron” translation for the EGFR gene. f Extended exon protein variant of gene MPRIP was identified with three novel peptides

TMT-based quantitative proteogenomics analysis

Tissue specificity or regulation upon stimuli can hint at functional roles of the discovered proteins. Some pseudogene proteins (e.g., those with GAPDH gene ancestry) showed broad tissue expression similarly to that of their parental gene (Supplementary Data 8). Others showed protein expression prevalent to particular tissues. For example, pseudogene proteins TATDN2P1 (identified and quantified by 2 peptides, 3 PSMs) and UBE2L5P (2 peptides, 6 PSMs) were specific to testis, whereas their parental gene proteins UBE2L3 and TATDN2 were broadly expressed on all measured tissues (Fig. 6 and Supplementary Data 8), suggesting independent functions for these two “pseudogenes” in testis. RNA-seq reads for both these pseudogenes support their testis specificity (Fig. 6d and Supplementary Data 5). Additionally, several novel peptides with placenta specificity were found from two lncRNAs, i.e., TINCR (also known as PLAC2 – placenta specific 2, which was identified by 1 peptide, 1 PSM) and CTD-2620I22.3 (ENSG00000267943, 4 peptides, 5 PSMs) (Figs. 5b and 6a, and Supplementary Data 8), both also showing placenta specificity at transcript level (Supplementary Figure 16). TINCR is thought to regulate differentiation in epidermal tissue, and our present result suggests that such molecular function is not restricted to the RNA product of the gene but may extend also to the protein product.

Fig. 6
Fig. 6

Quantitative analysis of novel peptides identified in the normal tissues data set. a Novel peptide tissue expression. Pearson correlation and complete linkage method was used for clustering. Row Z-scores are shown in the heat map. b TMT-based tissue quantification of the pseudogene TATDN2P1 peptides points to testis specificity. The three dots in the TMT ratio plots indicate quantification of three individual PSMs, with the center bar as the mean and error bars as standard deviation. c TMT-based tissue quantification of TATDN2 peptides indicates broad tissue expression (quantification values from three PSMs). d RNA-seq read counts of TATDN2P1 in different tissues confirms testis specificity

Upon EGFR inhibition of A431 cells, some of the discovered pseudogene proteins, e.g., RBBP4P1 (identified and quantified with 3 peptides, 8 PSMs), followed the same quantitative pattern as the parent gene proteins (Supplementary Data 8). There were also examples of pseudogenes, such as HSPA8P1 (3 peptides, 10 PSMs) and ANAPC1P1 (1 peptide, 2 PSMs, out of only two unique theoretical tryptic peptides differing from the canonical parent ANAPC1), showing very distinct quantitative patterns from their parent genes, suggesting that, in these latter cases, the pseudogenes possess an independent transcriptional control. We investigated whether, in general, protein-level regulation between pseudogenes and the respective parent genes correlated. Out of 242 pseudogene-parent gene pairs (median correlation coefficient 0.26), 20 correlated significantly (test of association, p value <0.01), 16 were positively correlated (mean Pearson correlation 0.88), and 4 were negatively correlated (mean Pearson correlation −0.79) (Supplementary Figure 17).

Some AltORF proteins such as Alt-HNRNPUL2 (identified and quantified with 3 peptides, 7 PSMs) and Alt-DRAP1 (4 peptides, 14 PSMs), resulting from upstream translation were found in the A431 data set to be downregulated 24 h after treatment, whereas their canonical counterparts showed slight upregulation (HNRNPUL2) or no clear regulation (DRAP1) (Fig. 5c and Supplementary Data 8). Similarly, N-terminal extended proteins resulting from in-frame upstream TIS such as those from RAB12 (2 peptides, 2 PSMs) and WDR26 (1 peptide, 4 PSMs, from a UTR region which accommodates only one theoretical tryptic peptide of MS-amenable size) showed downregulation after treatment while their respective canonical proteins remained constant. Translational competition between upstream TIS and canonical TISs has been observed previously, in accordance to the “leaky scanning” model of translation38. Here, we observed four cases of usage of upstream TIS decreased by the EGFR inhibition treatment.

Source link


Please enter your comment!
Please enter your name here