Development of full pI range HiRIEF LC-MS
To increase the potential for discovery of unannotated coding regions in a proteogenomics experiment, in-depth analysis of the proteome is needed. We employed immobilized pH gradient (IPG) strips that cover the full peptide pI range to enable the exploration of the entire tryptic peptidome (Fig. 1a). As a model system, we used the A431 cell line (TMT10-plex labeled peptides of a time course treatment with an EGFR inhibitor, Supplementary Figure 1). The previously established1 pI range 3.7–4.9 allowed the identification of 93,318 unique peptides when searching against the canonical database. The newly employed pI range 3–10 added 75,825 unique peptides to the results, and further addition of pI ranges 6–9 and 6–11, an additional 34,497 unique peptides (Supplementary Figure 2). Since the combination of pI ranges 3.7–4.9 and 3–10 achieved the bulk of the analytical depth, we applied only these two strips in a subsequent analysis of the normal tissue samples (Supplementary Figure 1). A total of 203,640 and 169,471 peptides were identified (peptide level FDR 1%) from the A431 and normal tissues experiments, respectively, corresponding to 10,166 and 10,889 genes (protein-level FDR 1%) (Table 1). Combining peptides identified from the two experiments, we covered half of the fully tryptic human peptides existing in the PeptideAtlas repository (release 2017–01) and added 21,619 human peptides not present in that repository (Fig. 1b).
To benchmark our data, we compared these results to those from corresponding tissues (placenta, kidney, liver, tonsil, and testis) in the Wilhelm et al. draft proteome work21 by re-analyzing their raw data with the same search pipeline (MSGF+, Percolator, with peptide and protein-level FDR 1%) as employed for our data set. We identified 56% more peptides (169,471 vs. 108,402) and 24% more genes with protein products (10,889 vs. 8796), and consequently, protein identifications were generally backed up by larger numbers of unique peptides (i.e., 90 vs. 75% of protein identifications supported by at least 2 peptides) (Supplementary Figure 3, which includes MS run time comparison). Additionally, we performed a simple test and verified that not a single peptide from olfactory receptor proteins could be found in either of our data sets22.
The obtained experimental peptide pI data permitted the upgrade of the PredpI algorithm1 with updated pK constants (available in Supplementary Data 1) now valid for the full pI range. Applying PredpI prediction to peptide sequences from an in silico tryptic digestion of the Ensembl75 human protein database enabled the alignment of the theoretical tryptic peptide distribution over the full pI range with the experimental one (Fig. 1a). The low experimental yield of peptides with pI ~8.5 could perhaps be explained by polyacrylamide gel instability in the most alkaline range23. For the experimental distribution, only tightly focused peptides (i.e., present in only one or two consecutive fractions, on average 80% of all peptides in the used IPG ranges, Supplementary Figure 4) were considered. This tryptic-peptide pI distribution appears to be ubiquitous to all organisms (Supplementary Figure 5) and based on the acid-base chemistry of amino acid residues and the similar distribution of amino acid frequencies in all species24. Peptide fractionation was reproducible across samples and IPG ranges (Supplementary Figure 6) and the unique peptide yield of pI fractions was characteristic to each IPG strip used (Supplementary Figure 7).
A three-stage proteogenomics workflow
To tackle the high rate of false discoveries in proteogenomics findings while taking into account the recent recommendations in the field3, 19, we developed IPAW, a proteogenomics workflow to identify and curate novel peptides from undiscovered protein-coding loci as well as variant peptides coded by nsSNPs and mutations (Fig. 2). The workflow contains three major stages: discovery, curation, and validation. To illustrate the discovery stage, we carried out two types of proteogenomics searches, the VarDB concatenated database search (a customized database search strategy applicable for any high-resolution LC-MS/MS data) (type 1 search in Fig. 2) and the 6FT search, which uses peptide pI values to restrict database size (applicable if pI-based fractionation is used)1, 25 (type 2 search in Fig. 2). It is also possible to use any customized databases generated by external tools14,15,16,17. All MS/MS spectra were searched by MS-GF+26, and post processed with Percolator27 under the Galaxy platform using a separate target-decoy strategy. The human VarDB consists of peptides originating from previously annotated nsSNPs, somatic mutations, pseudogenes, and long non-coding RNAs, forming a supplementary set of peptides to the canonical proteome. Customized database searches have the advantage over 6FT searches in that they allow detection of splice-junction-spanning novel peptides and have a relatively smaller search space, while the 6FT searches allow unbiased screening of protein-coding sequences in the whole genome. The discovery stage outputs a list of candidate novel and SAAV peptides with 1% class-specific FDR3.
At the curation stage, some of the candidate novel peptides are removed. First, peptides are processed through BLASTP28 to remove exact matches to the known protein database, which combined known human proteins from the Uniprot reference proteome (UP000005640), Ensembl 83, Refseq, and Gencode v24. Thereafter, the subset of novel and SAAV peptides with single amino acid changes compared to known peptides are curated by a tool, SpectrumAI, which only keeps those peptides whose MS2 spectrum contains product ions directly flanking both sides of the substituted residue. We then use BLAT29 to remove peptides mapping to multiple locations in the genome. Even though these peptides are as experimentally valid as other peptides, we decided not to prioritize them because of the increased difficulty in assigning orthogonal support and in drawing biological conclusions from proteins with uncertain genomic locations.
In the validation stage, curated novel peptides are assessed by distribution plots of intrinsic properties such as delta pI (pI difference between experimental and predicted pI values), precursor mass error, and match score. Finally, the curated novel peptides are examined in the context of their genomic positions by cross validation against multiple levels of orthogonal data. These levels include: (i) RNA-seq acquired from the same samples, (ii) public domain Ribo-seq for support of alternative translation initiation sites (TIS)30, 31, (iii) public domain CAGE data for evidence of transcription start sites (TSS)32, (iv) vertebrate conservation (PhyloP)33, (v) codon substitution frequency (PhyloCSF)34, which predicts coding potential of DNA sequences based on codon composition and substitution frequency by comparative genomic analysis, and (vi) MS data from the two draft proteome publications21, 35. It should be noted that the validation stage does not use any of these orthogonal data as filtering criteria to discard proteogenomic findings, but only provides a means for users to rank candidate novel peptides based on orthogonal support.
SpectrumAI verifies single amino acid substitutions
Novel and SAAV peptides were output by 1% class-specific FDR3, which is essential but insufficient to guarantee a de facto 1% false discovery rate. In practice, unexpected isobaric modifications or amino acids reshuffles in known peptides could be mistakenly identified as single-substitution peptides, since the identification of peptide sequence is a pattern matching process, where the search engine only guarantees the best sequence match available in the database. Due to this uncertainty, novel peptides with single substitution were discarded in a recent proteogenomics study20. Nonetheless, manual inspection of MS/MS spectra should allow for distinction between correct and incorrect assignments. However, this is a laborious process and therefore inapplicable to large data sets. A previously published tool, CAMV36, allows computer-aided manual inspection on peptide spectra matches. Although it reduces workload of manual validation, it still needs a mass spectrometry expert to make the final call to accept or reject. CAMV is more suitable to validate a limited list of peptides. For peptides with single amino acid changes, we developed a tool, SpectrumAI, which automatically eliminates incorrect peptides by inspecting MS/MS spectra for flanking product ions supporting the amino acid substitution. We demonstrated the ability of SpectrumAI to eliminate incorrect single substitution peptide identifications via (i) analysis of the distribution of precursor mass error, (ii) orthogonal evidence at DNA and RNA level, and (iii) validation using the corresponding synthetic peptides.
In the A431 data set, 4492 SAAV peptides were identified at 1% class-specific FDR. Of these, 1332 have MS/MS ions flanking the substitution and thus passed SpectrumAI curation (Supplementary Data 2). Comparing the distribution of precursor mass errors, curated peptides showed a distribution similar to that of identified known peptides, whereas discarded peptides did not (Fig. 3a). Importantly, our observation that search engine score (SpecEvalue) distributions of curated peptides and discarded peptides were similar and indicates the need for independent evidence in addition to strict FDR control, which itself depends primarily on search engine score (Supplementary Figure 8). Further, we validated the SAAV peptides using genomics and transcriptomics data. About 44% of peptides passing SpectrumAI curation were supported either at RNA or DNA level, whereas only 3% of discarded peptides had such support (Fig. 3b). We compared the frequency of amino acid changes of the 735 curated SAAVs lacking orthogonal support with that of the 426 supported both at DNA and RNA level, and observed that certain types of amino acid changes (the top one being glutamine to glutamic acid) were over-represented (Supplementary Figure 9). A plausible explanation could be that some curated SAAV peptides are chemical artifacts, arising by, e.g., deamidation.
Finally, we selected 30 peptides with single amino acid substitution from the discovery stage (of which 19 passed SpectrumAI curation, whereas 11 did not), and purchased the corresponding synthetic peptides for validation. Manual inspection of the mirror plots (Fig. 3c and Supplementary Data 3, pp. 1–30) confirmed the 19 SpectrumAI curated peptides to be correct, and the 11 discarded ones to be incorrect.
Novel peptides from unannotated protein-coding loci
IPAW was applied to discover unannotated protein-coding loci in the human genome using both the A431 cells data set and the normal tissues data set. The combined result of 6FT and VarDB searches (Supplementary Figure 10), at the discovery stage, yielded 710 and 295 novel peptides (class-specific FDR 1%) in A431 cells and normal tissues, respectively, of which 426 and 155 passed the curation stage (Supplementary Figure 11 and Supplementary Data 4 and 5). Precursor mass error, SpecEvalue and delta pI distributions of these curated peptides were assessed in comparison to those of known canonical peptides (Supplementary Figures 12–14). Novel peptides within genomic proximity of 10 kb were grouped and considered to belong to the same locus. Thus, we identified 374 and 140 unknown protein-coding genomic loci in A431 cells and normal tissues, respectively, of which 42 and 13 were supported by two or more peptides (Fig. 4a and Supplementary Figure 15a). Novel peptides were categorized into eight different classes: pseudogene, 5′ un-translated region (UTR), intronic, AltORF, ncRNA, exon extension, intergenic, and 3′ UTR (Fig. 4b and Supplementary Figure 15b) and were found in all chromosomes across the genome (Fig. 4c and Supplementary Figure 15c).
The largest group of novel peptides identified belonged to pseudogenes, which, given the previous observation of thousands of pseudogenes found expressed at RNA level in cancer37, is maybe to be expected, particularly in cancer cell lines such as A431. There were 30 novel peptides identified from sORFs (the majority being AltORFs) in known coding transcripts after examination of the genomic context. A mass spectrometry study performed by Slavoff et al. identified 90 sORF-encoded peptides in human cells11. Four peptides were common between our list and their study despite the fact that they used a different cell line (human leukemia cell line K562) and workflow. In our present study, regarding protein N-terminal extensions, 45 out of 50 cases used non-AUG near-cognate translation start codons, the majority (30 cases) of which was within a strong Kozak motif (Supplementary Data 4). A non-AUG TIS within a strong (or occasionally moderate) but not optimal Kozak sequence may be characteristic of protein N-terminal extensions. The usage of non-AUG start codons is corroborated by the work of Fritsch et al., a ribosomal footprinting-based study, which reported that only 1% of protein N-terminal extensions used canonical AUG start codons30. Novel peptides belonging to AltORFs and protein N-terminal extensions showed stronger support from orthogonal data compared to novel peptides from other groups (Fig. 4d and Supplementary Figure 15d). Although most peptides mapping to pseudogenes in the current study lacked RNA-seq read coverage, external evidence from ribosome profiling and CAGE was more supportive. Moreover, conservation scores of pseudogenes/non-coding RNAs with novel peptides identified were significantly higher than those of 1000 randomly selected pseudogenes/non-coding RNAs (Fig. 4e).
Since signal peptides are usually located in the N termini of proteins, we tested the hypothesis of whether an N-terminal extension could alter protein subcellular localization38 by using the TargetP algorithm (v1.1)39 to predict subcellular localization. Three of the discovered protein N-terminal extensions, in genes CDK16 (identified by 2 peptides), NPLOC4 (2 peptides), and THOP1 (2 peptides), showed a high likelihood of being targeted to the mitochondria while the respective canonical proteins have cytosolic or nuclear location (Supplementary Data 6). This result suggests that, for certain genes, the existence of an upstream TIS in the same reading frame as the canonical AUG start site may be used to submit the subcellular location of the protein products to control at the translation level.Since the references were not cited in numerical order, they have been renumbered in the order of appearance. Please check.I have checked that all the references are cited in the appropriate place in the text.
Other types of events found include protein products of exon extensions and intron retentions. Interesting examples of the latter are the translations of alternatively spliced (or miss-spliced) mRNAs from EGFR (identified by 2 peptides) and AP1S1 (2 peptides). The resulting polypeptide products contain part of the canonical N terminus followed by a completely new intron-derived C-terminal sequence, 181 and 42 amino acids long for EGFR and AP1S1, respectively. It is unclear whether such observations are errors of splicing machinery or a functional cellular response to stress conditions.
Finally, to further validate the novel peptides, 100 synthetic peptides in addition to the 30 mentioned above were purchased and analyzed. Out of these 100 synthetic peptides, 2 failed to generate good fragmentation spectra, 7 demonstrated their endogenous counterparts to be incorrect identifications upon manual inspection, and the remaining 91 novel peptides were successfully validated (Supplementary Data 3, pp. 31–128). The significant number (7 out of 98) of incorrect novel peptides highlights the need for further efforts into curation/validation of findings in the proteogenomics field.
For submission to UniProt, 40 unannotated coding loci with multiple peptides that could be linked to in-frame initiation start codons were included (Supplementary Data 7, with selected examples showing each event type in Fig. 5).
TMT-based quantitative proteogenomics analysis
Tissue specificity or regulation upon stimuli can hint at functional roles of the discovered proteins. Some pseudogene proteins (e.g., those with GAPDH gene ancestry) showed broad tissue expression similarly to that of their parental gene (Supplementary Data 8). Others showed protein expression prevalent to particular tissues. For example, pseudogene proteins TATDN2P1 (identified and quantified by 2 peptides, 3 PSMs) and UBE2L5P (2 peptides, 6 PSMs) were specific to testis, whereas their parental gene proteins UBE2L3 and TATDN2 were broadly expressed on all measured tissues (Fig. 6 and Supplementary Data 8), suggesting independent functions for these two “pseudogenes” in testis. RNA-seq reads for both these pseudogenes support their testis specificity (Fig. 6d and Supplementary Data 5). Additionally, several novel peptides with placenta specificity were found from two lncRNAs, i.e., TINCR (also known as PLAC2 – placenta specific 2, which was identified by 1 peptide, 1 PSM) and CTD-2620I22.3 (ENSG00000267943, 4 peptides, 5 PSMs) (Figs. 5b and 6a, and Supplementary Data 8), both also showing placenta specificity at transcript level (Supplementary Figure 16). TINCR is thought to regulate differentiation in epidermal tissue, and our present result suggests that such molecular function is not restricted to the RNA product of the gene but may extend also to the protein product.
Upon EGFR inhibition of A431 cells, some of the discovered pseudogene proteins, e.g., RBBP4P1 (identified and quantified with 3 peptides, 8 PSMs), followed the same quantitative pattern as the parent gene proteins (Supplementary Data 8). There were also examples of pseudogenes, such as HSPA8P1 (3 peptides, 10 PSMs) and ANAPC1P1 (1 peptide, 2 PSMs, out of only two unique theoretical tryptic peptides differing from the canonical parent ANAPC1), showing very distinct quantitative patterns from their parent genes, suggesting that, in these latter cases, the pseudogenes possess an independent transcriptional control. We investigated whether, in general, protein-level regulation between pseudogenes and the respective parent genes correlated. Out of 242 pseudogene-parent gene pairs (median correlation coefficient 0.26), 20 correlated significantly (test of association, p value <0.01), 16 were positively correlated (mean Pearson correlation 0.88), and 4 were negatively correlated (mean Pearson correlation −0.79) (Supplementary Figure 17).
Some AltORF proteins such as Alt-HNRNPUL2 (identified and quantified with 3 peptides, 7 PSMs) and Alt-DRAP1 (4 peptides, 14 PSMs), resulting from upstream translation were found in the A431 data set to be downregulated 24 h after treatment, whereas their canonical counterparts showed slight upregulation (HNRNPUL2) or no clear regulation (DRAP1) (Fig. 5c and Supplementary Data 8). Similarly, N-terminal extended proteins resulting from in-frame upstream TIS such as those from RAB12 (2 peptides, 2 PSMs) and WDR26 (1 peptide, 4 PSMs, from a UTR region which accommodates only one theoretical tryptic peptide of MS-amenable size) showed downregulation after treatment while their respective canonical proteins remained constant. Translational competition between upstream TIS and canonical TISs has been observed previously, in accordance to the “leaky scanning” model of translation38. Here, we observed four cases of usage of upstream TIS decreased by the EGFR inhibition treatment.