How to Illuminate the Dark Proteome Using the Multi‐omic OpenProt Resource

Ten of thousands of open reading frames (ORFs) are hidden within genomes. These alternative ORFs, or small ORFs, have eluded annotations because they are either small or within unsuspected locations. They are found in untranslated regions or overlap a known coding sequence in messenger RNA and anywhere in a “non‐coding” RNA. Serendipitous discoveries have highlighted these ORFs’ importance in biological functions and pathways. With their discovery came the need for deeper ORF annotation and large‐scale mining of public repositories to gather supporting experimental evidence. OpenProt, accessible at https://openprot.org/, is the first proteogenomic resource enforcing a polycistronic model of annotation across an exhaustive transcriptome for 10 species. Moreover, OpenProt reports experimental evidence cumulated across a re‐analysis of 114 mass spectrometry and 87 ribosome profiling datasets. The multi‐omics OpenProt resource also includes the identification of predicted functional domains and evaluation of conservation for all predicted ORFs. The OpenProt web server provides two query interfaces and one genome browser. The query interfaces allow for exploration of the coding potential of genes or transcripts of interest as well as custom downloads of all information contained in OpenProt. © 2020 The Authors.


INTRODUCTION
Historically, open reading frames (ORFs) shorter than 100 codons were discarded from genome annotations unless previously characterized, as they were deemed too short to be functional (Cheng et al., 2011). This length criterion, alongside requirement of an ATG start codon and the restriction of a single coding sequence per transcript, has considerably shaped and limited the exploration of the proteome (Brunet, Levesque, Hunting, Cohen, & Roucou, 2018;Hellens, Brown, Chisnall, Waterhouse, & Macknight, 2016;Olexiouk & Menschaert, 2016;Orr, Mao, Storz, & Qian, 2019). Deeper ORF annotation Known protein present in current annotations (Ensembl and/or RefSeq) and/or UniProtKB ENSP*** NP_*** or XP_*** UniProt accession Isoform Non-annotated protein with high homology to a known protein from the same gene II_*** AltProt Non-annotated protein with no significant homology to a known protein from the same gene IP_*** is key to functional proteomic discoveries and a better understanding of physiological and pathological mechanisms Ma et al., 2014;Menschaert et al., 2013;Samandi et al., 2017). With the development of ribosome profiling (Ingolia, 2014), a technique detecting ribosome-protected fragments (footprints) originating from translating ribosomes, all translation events across the genome can potentially be captured . This observation led the community to use ribosome profiling to capture a deeper ORF landscape and identify small ORF (sORF) candidates for functional characterization (Andreev et al., 2015a(Andreev et al., , 2015bBazzini et al., 2014;Chen et al., 2020;Menschaert et al., 2013). Several repositories of sORFs have been published, all relying on ribosome profiling data for ORF annotation (Hao et al., 2017;Olexiouk, Van Criekinge, & Menschaert, 2018;Xie et al., 2016). Despite being an undeniable resource for novel ORF identification, the ribosome profiling technique still presents biases that can hinder detection of functional yet non-annotated ORFs, including ORFs in low-abundance transcripts or in repetitive regions (Brar & Weissman, 2015;Brunet et al., 2018;Ingolia, 2014;Ingolia, Ghaemmaghami, Newman, & Weissman, 2009;Raj et al., 2016).
At OpenProt , we computationally predict all possible alternative ORFs (altORFs) from an exhaustive transcriptome. All known transcripts are retrieved from both Ensembl and NCBI RefSeq annotations (O' Leary et al., 2016;Yates et al., 2020), and after in silico translation, all ORFs starting with an ATG and longer than 30 codons are listed. The predicted proteins are then divided into three categories (Table 1): known proteins are called refProts, non-annotated proteins similar to a known protein in the same gene are called novel isoforms, and non-annotated proteins with no significant similarity to a known protein in the same gene are called altProts. Such a computational strategy allows for annotation of an exhaustive set of ORFs, yet it also certainly results in false positives. Thus, OpenProt evaluates protein conservation and mines ribosome profiling and proteomics datasets to cumulate experimental evidence for all predicted proteins (Barrett et al., 2013;Deutsch et al., 2020;Perez-Riverol et al., 2019). All evidence is listed in the OpenProt resource, allowing the user an in-depth review of evidence for any protein supported by OpenProt. OpenProt currently supports 10 species and explores 114 proteomics and 87 ribosome profiling datasets. For a complete overview of the OpenProt resource, including the computational and analytic methods, we refer the user to the original publication . The web server also contains a detailed help section with tutorials and frequently asked questions (https:// openprot.org/ p/ help).
Basic Protocol 1 described here guides the novice user in how to explore ORFs using the Search interface. This protocol is designed to provide a rapid view of the coding potential and translation products of genes of interest. Basic Protocol 2 describes how to download custom data from the OpenProt resource. Guidelines for investigation of a specific altORF are provided afterward, alongside discussion on critical parameters.

USING THE SEARCH INTERFACE
This protocol details the use of the Search interface to query specific genes, transcripts, or proteins. The interface is optimized to accommodate many questions that a researcher may have. For example, a researcher may want to know if a specific gene contains novel ORFs with supporting experimental evidence or whether a given transcript may contain several ORFs. The protocol will guide novice users in how to exploit the Search interface of the OpenProt resource. First, the protocol describes how to navigate to the Search interface from the homepage. Then, it details the features available on the interface to tailor the results to any query. Finally, the protocol explains how to investigate a specific ORF of interest.

Necessary Resources
OpenProt is accessible via all major web browsers supporting JavaScript, such as Safari, Firefox, Chrome, or Internet Explorer. All pages can be viewed on mobile phones, but the interfaces have been optimized for display on computers or tablets. The Search interface is designed for exploration of specific genes, transcripts, and/or proteins.

Home
The Home tab navigates to the OpenProt homepage. The page contains general information about the web server, the reasons to use the OpenProt resource, links to detailed tutorials, and an overview of the concept and methods behind OpenProt.

Browse
The Browse tab navigates to a genome browser for each species, with customizable tracks, allowing visualization of all ORFs present in OpenProt.

Search
The Search tab navigates to the OpenProt query interface.

Downloads
The Downloads tab navigates to a query interface for custom downloads.

About
The About tab navigates to a page containing general information about the resource, the developers and funding agencies, and OpenProt publications.

Help
The Help tab navigates to a page containing detailed tutorials and frequently asked questions (FAQs) about OpenProt.  Table 3.  The +1 frame is assigned to the first nucleotide of the transcript. Available options are 1, 2, or 3.

Active (triangles)
Order by a List of supported sorting rules for the  table of results The MS, TE, and predicted Domains scores are always sorted in descending order. Available sorting orders:   Table 4.
Any combination of these filters is possible, and each is described, with complementary notes, in Table 3. Table 3 also contains an overview and description of the advanced criteria.
5. Use "order by" and "column settings" filters to sort and arrange the table of results. Table 3.

Exploring the table of results
6. Click on blue box "update search results" (Fig. 2) to view the table of results (Fig. 3). 7. Go to bottom of the table to navigate between the different pages of results.

This table contains general information about the proteins fitting the search criteria specified by the user (
The total number of proteins fitting the search criteria is displayed next to the blue "search" box above the table of results (Fig. 3). The OpenProt web server shows 20 protein entries per page.
8. Click "share" button, which appears at the top right of the table, next to the sorting and download options, to display a shareable link to this specific search result.
The link shared above for the example search was generated using this feature (https:// openprot.org/ p/ savedSearch/ LCa).
Inspecting a specific protein 9. Click "details" link in the main table of results to navigate to a page dedicated to the queried protein.
Brunet et al.

of 20
Current Protocols in Bioinformatics Possible entries are RefProt, Isoform, or AltProt. "AltProt" is written in red.

Protein length
The length of the protein is reported in amino acids (a.a.).
OpenProt annotates all known proteins and any novel protein longer than 30 amino acids.

Experimental evidence: MS
This column reports the mass spectrometry (MS) score for the given protein.
The MS score corresponds to the sum of unique peptides detected per study.

Experimental evidence: TE
This column reports the translation event (TE) score for the given protein.
The TE score corresponds to the sum of studies with at least one significant detection of translation.

Functional prediction: Domains
This column reports the number of predicted functional domains for the given protein.
Prediction of functional domains is done using InterProScan.

Functional prediction: Orthology
This column reports the number of species with at least one ortholog for the given protein, as well as the species concerned.
The species names are abbreviated using the first letters of the species and subspecies, and they are colored based on the identity percentage of the orthologous protein pair (the darker, the higher).

Species
This column indicates the species from which the given protein originates.

Gene
This column indicates the gene from which the given protein originates.
The gene name is retrieved from the annotation (Ensembl and/or NCBI RefSeq).
Transcript accession This column indicates the accession number of the transcript from which the given protein originates.
The transcript accession is retrieved from the annotation (Ensembl and/or NCBI RefSeq).

Type
This column indicates the type of transcript from which the given protein originates.
Possible entries are ncRNA (non-coding RNA) or mRNA (messenger RNA).

Localization
This column indicates the localization of the given altProt on the transcript relative to the canonical protein associated with this transcript.
Within mRNAs, the localization of altORFs is defined according to the localization of the predicted start codon with respect to that of the refProt. Possible entries are 5 UTR, CDS (overlapping), and 3 UTR. For altORFs within ncRNAs, no localization is inferred.

Details
This column contains a link to the page dedicated to the given protein.
This page is detailed in Table 5.
Brunet et al.

of 20
Current Protocols in Bioinformatics This page provides all the information contained in OpenProt for this protein. The accession number of the protein being investigated is always written at the top left of the page. As an example, we will use the altProt IP_662512 present in the table of results of the aforementioned search (https:// openprot.org/ p/ savedSearch/ LCa). The page opens in the "info" tab, which displays an overview of the genomic and transcriptomic information associated with the protein (Fig. 4). The info page is detailed in Table 5.
The OpenProt page dedicated to a protein contains five tabs: the "info" tab (described in Fig. 4 and Table 5), the "mass spectrometry" tab (see step 12), the "translation" tab (see step 13), the "domains" tab (see step 14), and the "conservation" tab (see step 15; all described in Fig. 5). This allows the user to review all the experimental evidence and functional predictions for the queried protein.
10. Use genome browser to visualize the queried protein and the associated transcript.
The peptides detected by mass spectrometry and assigned to the queried protein are displayed in the peptide track (Fig. 4,   Each line corresponds to a different transcript but to the same protein (same amino acid sequence).
Gene -This column contains the name of the gene from which the queried protein originates.
In rare exceptions, Ensembl and NCBI RefSeq annotations might not use the same synonym for the gene name.
Annotation -This column contains the annotation from which the queried protein is derived.
All supported annotations are listed in Table 3.
Genomic coordinates -This column contains the genomic coordinates of the queried protein.
These coordinates do not correspond to the gene or the transcript, but rather to the queried protein mapped back onto the genome.
Strand -This column indicates on which genomic strand the queried protein is encoded.
The strand is retrieved from the annotation. (Continued)

of 20
Current Protocols in Bioinformatics This tab navigates to a page listing mass spectrometry-based evidence for the queried protein.
The mass spectrometry tab is detailed in Figure 5.

Translation b
This tab navigates to a page listing ribosome profiling-based evidence for the queried protein.
The translation tab is detailed in Figure 5. (Continued)

of 20
Current Protocols in Bioinformatics The domains tab is detailed in Figure 5.

Conservation
This tab navigates to a page listing orthologs and paralogs of the queried protein.
The conservation tab is detailed in Figure 5.
a Symbols and abbreviations: #, flag name in Figure 4; TIS, translation initiation sequence. b PMIDs are 7301588 and 12459250. c PMID is 25170020.
information 12. Click on "mass spectrometry" tab to review experimental detection of the queried protein in mass spectrometry-based proteomic datasets.
The number displayed on the tab corresponds to the MS score, defined as the sum of unique peptides detected per study . 13. Click on "translation" tab to review experimental detection of translation of the queried ORF in ribosome profiling datasets.
The number displayed on the tab corresponds to the TE score, defined as the sum of studies with at least one significant detection of ORF translation . This tab provides an overview of all studies and samples in which translation of the queried ORF was detected by re-analysis of ribosome profiling datasets by OpenProt. OpenProt uses an ORF prediction algorithm, PRICE, to analyze ribosome profiling datasets (Erhard et al., 2018). All ribosomal data (elongating and initiating footprints) are combined to estimate the ORF most likely to produce such a set of footprints. Thus, for each detection in a study, OpenProt can assign a confidence to the initiating codon and a p-value to the ORF detection itself (Fig. 5).
14. Click on "domains" tab to review prediction of functional domains for the queried protein using multiple domain annotation databases.
The number displayed on the tab corresponds to the Domains score, defined as the sum of functional domains predicted from the protein sequence . This tab provides an overview of all domains predicted as well as the database in which each domain is described.
15. Click on "conservation" tab to review conservation of the queried protein across species supported by OpenProt.

of 20
Current Protocols in Bioinformatics The study column contains 3 informations: -The name (usually its accession number) -The link (usually to the public repository) -The publication (PMID number) The peptide column contains the sequences of identified peptides for each study.
The match count columns contains the number of peptide-spectrum match for each peptide in each study. Link to the original study Annotation used Genomic coordinates Detected start codon of the detected ORF Transcript type Score for the start codon (the larger the better) Score for the start codon relative to the genomic context (the larger the better) Confidence of the detected ORF not being attributable to noise Sample names with associated read counts for the detected ORF Total readcount for the detected ORF Transcript and protein accession, followed by the overlap between the detected ORF and the predicted one by OpenProt The number displayed on the tab corresponds to the Conservation score, defined as the sum of all species supported by OpenProt with at least one ortholog . This tab provides an overview of all orthologs and paralogs of the queried protein detected by the OpenProt conservation analysis. Two trees are accessible on the page (Fig. 5)

USING THE DOWNLOADS INTERFACE
This protocol details the use of the Downloads interface to retrieve a large amount of data stored in the OpenProt resource. The interface is optimized to obtain custom downloads for specific research questions. For example, a researcher may want to download a FASTA file containing only the sequences of altProts and novel isoforms with experimental evidence or a BED file containing the genomic coordinates of all proteins predicted by OpenProt. This protocol will guide novice users in how to exploit the Downloads interface of the OpenProt resource. First, the protocol describes how to navigate to the Downloads interface from the homepage. Then, it details the features available on the interface to tailor results to any query. Finally, the protocol explains the different file formats available.

Necessary Resources
See Basic Protocol 1.
Navigating from the homepage to the Downloads interface 1. Navigate to Downloads interface according to Basic Protocol 1, steps 1 and 2.
As explained above in Basic Protocol 1 (step 2), at the top of the OpenProt homepage (https:// openprot.org/ ), the navigation bar contains six tabs: Home, Browse, Search, Downloads, About, and Help (Fig. 1) Table 2.

Exploring the Downloads interface 2. Use query filters to define a search.
The Downloads interface is accessed either directly at https:// openprot.org/ p/ download or through the homepage, as described in step 1. The interface is pictured in Figure 6, where the query filters are indicated by green circles, the summary columns of the table of results by orange squares, and downloadable file options by yellow triangles. This page is designed to allow the user to query specific downloads for optimal use in downstream analyses.
All the filters are described, with additional notes, in Table 6. In contrast to the Search interface (Basic Protocol 1), not all species can be selected at once on the Downloads interface. This limitation is due to the excessive size of the resulting files. Users can either download data for each species individually or write to the developers if the sought-after information is not available. The authors can be contacted using the light blue "contact us" link at the bottom of the page.

of 20
Current Protocols in Bioinformatics   3. Click on desired file name to start the download.
For every query, four file formats are available, as described in Table 6. These are designed to optimize any downstream analyses using OpenProt data.
OpenProt is a release-based resource and is continuously developed in accordance with the FAIR Guiding Principles for scientific data management and stewardship (Wilkinson et al., 2016). This ensures continuous availability of all data in OpenProt over time.

GUIDELINES FOR UNDERSTANDING RESULTS
OpenProt is a proteogenomic resource that seeks experimental evidence for predicted novel proteins from non-annotated ORFs . OpenProt is open source, all methods and codes are published and freely available Samandi et al., 2017), and all supported data are freely accessible and downloadable (Basic Protocols 1 and 2). At OpenProt, we predict all possible ORFs longer than 30 codons throughout the annotated transcriptome for 10 species. This approach was chosen to be as inclusive as possible for predictions and to then retrieve experimental evidence for each prediction. Thus, OpenProt is not dependent on a specific experimental bias, but the user has to be aware that false positives are a reality with such design. Not all predicted proteins in OpenProt are likely expressed. Thus, because noise and nonspecific detections vary across experimental datasets and designs, we encourage users to seek experimental detection across multiple datasets to increase confidence in an altProt and/or the existence of a novel isoform.
Broadly, there are two major usages of the OpenProt resource. First, users may be interested in a specific gene or transcript and wonder whether they are capturing its full coding potential. To that end, users should use the OpenProt Search interface (Basic Protocol 1) and investigate each predicted protein in detail (Fig. 5). Second, users may be interested in analyzing their mass spectrometry-based proteomic data with the OpenProt database. Users should use the OpenProt Downloads interface for such a query (Basic Protocol 2). If users wish to tailor their mass spectrometry database to a specific set of transcripts, we encourage them to download the full database and keep only entries of interest based on the transcript accession (TA field in the FASTA header). Users may also use the OpenProt Search interface to query specific transcripts and download the results as a FASTA file. Please note, however, that for computational reasons, such queries are limited to 2000 genes (or transcripts) at a time.
Brunet et al.

of 20
Current Protocols in Bioinformatics Crucial OpenProt features and considerations heavily depend on the research question behind the query. For any question or additional information on data analysis and interpretation, contact the OpenProt team via the light blue "contact us" button at the bottom of all OpenProt pages (https:// groups.google.com/ forum/ #!forum/ openprot).

Background Information
The premises of the OpenProt resource were first published in 2013 (Vanderperre et al., 2013). The former HAltORF database was a mere list of altORFs within the human transcriptome (based on the NCBI Ref-Seq annotation). Community-driven requests and serendipitous discoveries contributed to the desire and need to develop OpenProt as the first proteogenomic resource to enforce a polycistronic annotation model on both coding RNA (messenger RNA, or mRNA) and non-coding RNA (ncRNA) transcripts (Samandi et al., 2017). The OpenProt resource was first officially released in 2019, contains 10 species, and cumulates experimental evidence using mass spectrometry and ribosome profiling data . Using cutting-edge algorithms for ribosome profiling and mass spectrometry data mining (Erhard et al., 2018;Vaudel, Barsnes, Berven, Sickmann, & Martens, 2011, OpenProt re-analyzed 87 and 114 datasets, respectively. OpenProt not only lists novel proteins with experimental evidence but also allows critical assessment of the evidence by the user. OpenProt is constantly re-analyzing datasets and adding new features, but all data are continuously available thanks to the releasebased structure of the resource. Suggestions of new features or additional species or datasets from the community are always welcome and can be submitted via the OpenProt discussion forum (https:// groups.google.com/ forum/ #!forum/ openprot).

Critical Parameters and Troubleshooting
We refer the user to the original article for explanation of the mass spectrometry pipeline enforced by OpenProt ), yet one needs to acknowledge the stringent 0.001% false discovery rate (FDR). Such an FDR balances the use of a large database that can affect the false positive rate in proteomics analyses (Jeong, Kim, & Bandeira, 2012;. Thus, an absence of detection by mass spectrometry in OpenProt does not necessarily mean that the protein does not exist. Such a pipeline will heavily hinder the detection of some proteins. As a guideline, the same standard mass spectrometry analysis filtered at a usual 1% FDR or the stringent 0.001% FDR may only share 40 to 80% of identifications depending on the spectral quality of the dataset (unpub. observ.). Similarly, an absence of detection by ribosome profiling does not mean that there is no evidence of translation. At the moment, OpenProt only incorporates ORFs predicted by the translation analysis pipeline (PRICE) that have a perfect overlap with the ORF predicted by OpenProt. Thus, if a start codon is a non-canonical codon upstream or downstream of the ATG predicted by OpenProt, no translation evidence will be reported by OpenProt. Implementation of such cases will be available in the next OpenProt release. Additionally, one should note that the p-value reported by the PRICE algorithm for each detected ORF is the result of a generalized binomial test (not corrected for multiple comparisons). Hence, the p-value indicates the confidence in the given ORF not being attributable to noise.
Finally, for each protein, a list of paralogs and orthologs is provided in the "conservation" tab (described in Fig. 5). The user should note that this list is restricted to species currently supported by OpenProt (listed in Table 3). For a more exhaustive list, the user may want to use the BLASTp tool (Madden, Tatusov, & Zhang, 1996) to search a specific protein against a reference database such as the non-redundant NCBI or the UniPro-tKB protein database (Bateman et al., 2017;O'Leary et al., 2016). This analysis may identify proteins with significant sequence similarity in various species. supercomputer mp2 from Université de Sherbrooke. Operation of the mp2 supercomputer is funded by the Canada Foundation of Innovation (CFI), le ministère de l'Économie, de la science et de l'innovation du Québec (MESI), and les Fonds de Recherche du Québec.