Protein Sequence Analysis Using the MPI Bioinformatics Toolkit

The MPI Bioinformatics Toolkit (https://toolkit.tuebingen.mpg.de) provides interactive access to a wide range of the best‐performing bioinformatics tools and databases, including the state‐of‐the‐art protein sequence comparison methods HHblits and HHpred. The Toolkit currently includes 35 external and in‐house tools, covering functionalities such as sequence similarity searching, prediction of sequence features, and sequence classification. Due to this breadth of functionality, the tight interconnection of its constituent tools, and its ease of use, the Toolkit has become an important resource for biomedical research and for teaching protein sequence analysis to students in the life sciences. In this article, we provide detailed information on utilizing the three most widely accessed tools within the Toolkit: HHpred for the detection of homologs, HHpred in conjunction with MODELLER for structure prediction and homology modeling, and CLANS for the visualization of relationships in large sequence datasets. © 2020 The Authors.

Over the years, the Toolkit has established itself as an important resource for molecular biology research, mainly due to the sensitive sequence-comparison tools HHblits and HHpred, which, in many instances, can detect homologous relationships that are not readily recognized by other tools. A further strength of the Toolkit lies in the tight interconnection of the tools, allowing the results of one tool to be forwarded as input to others; for instance, the output of a PSI-BLAST search could be forwarded to Clustal to obtain a multiple sequence alignment (MSA) of the identified matches or to MMseqs2 to obtain a reduced set filtered by pairwise sequence identity. Finally, our implementations of some external tools offer enhanced features, such as versions of the NCBI nonredundant (nr) database for PSI-BLAST that are clustered down to 30% (nr30), 50% (nr30), 70% (nr30), or 90% (nr90) sequence identity.
In this article, we describe detailed protocols for the application of the three most frequently used tools. Basic Protocol 1 describes how to use HHpred to search for remote homologs of a protein and make inferences about its domain composition, structure, function, and evolution. The Alternate Protocol describes the pairwise comparison mode of HHpred, which allows two protein sequences or MSAs to be compared with each other. The Support Protocol describes how to build a custom, high-quality MSA starting with a protein sequence and use it as input for HHpred. Basic Protocol 2 describes how to use HHpred in conjunction with MODELLER (Sali & Blundell, 1993) to build a three-dimensional (3D) structural model for a protein sequence of interest. Basic Protocol 3 describes the use of PSI-BLAST in conjunction with CLANS to detect distant homologs of a protein of interest and then visualize the relationships between the detected homologs. To demonstrate these protocols, we use as an example the experimentally uncharacterized FtsZ protein of the Asgard group archaeon Prometheoarchaeum syntrophicum strain MK-D1, which currently represents the closest cultured prokaryotic relative of eukaryotes (Imachi et al., 2020). In most bacteria, many archaea, all chloroplasts, and some mitochondria, with the latter two representing endosymbiosisderived eukaryotic organelles, FtsZ forms filaments that assemble into a ring (Z-ring) at the future site of cell division Margolin, 2005;Szwedziak, Wang, Bharat, Tsim, & Lowe, 2014). Notably, eukaryotic tubulins, which polymerize to form microtubules, a major component of the cytoskeleton, are remotely homologous to FtsZ (Nogales, Downing, Amos, & Lowe, 1998). FtsZ and tubulins are GTPases that comprise an N-terminal GTP-binding domain with a highly conserved GGGTG(T/S)G motif associated with GTP binding and a C-terminal regulatory domain (Erickson, 1998). Strikingly, the pairwise sequence identity between FtsZ and tubulins is lower than 15%. Therefore, most sequence search methods fail to substantiate a homologous relationship between them. We note that the structure, function, and evolution of FtsZ and tubulins have been studied extensively, and that their evolutionary relatedness is also widely accepted (Erickson, 1998;Nogales et al., 1998). However, for instructional purposes, we will assume that the homology between them is unclear. In the following, we show how the Toolkit could be used to investigate the relationship between FtsZ and tubulins.

SEQUENCE SIMILARITY SEARCHING USING HHpred
An almost ubiquitous first step in the characterization of a protein is the identification of functionally and structurally characterized homologs using BLAST (Altschul et al., 1997) or HMMER (Potter et al., 2018). Frequently, however, these search methods fail to detect statistically significant connections to characterized proteins. In many such cases, the more sensitive sequence search method HHpred , which is based on the comparison of profile hidden Markov models (HMMs), is able to establish connections to remotely homologous, characterized proteins. Starting from a single protein sequence, HHpred builds a multiple sequence alignment using HHblits  or PSI-BLAST (Altschul et al., 1997) and annotates the obtained alignment with the predicted secondary structure using PSIPRED (Jones, 1999). Next, this annotated alignment is converted to a profile HMM and compared to each profile HMM in user-selected target databases, which represent proteins of known structure or annotated protein families. Such databases are, for example, the Pfam (El-Gebali et al., 2019), CDD (Lu et al., 2020), and SMART (Letunic & Bork, 2018) domain databases; the SCOPe (Fox et al., 2014) and ECOD (Cheng et al., 2014) structural classification databases; the Protein Data Bank (Berman et al., 2000); and proteomes of several model organisms. Database HMMs are built using three iterations of HHblits over UniRef30 (Mirdita et al., 2017), which is a version of the UniRef sequence database (Suzek et al., 2015) clustered into groups of similar sequences at a length coverage of at least 80% and a maximum pairwise sequence identity of 30%. Like query HMMs, database HMMs include secondary structure information, either predicted by PSIPRED or assigned based on 3D structure by DSSP (Joosten et al., 2011;Kabsch & Sander, 1983). The inclusion of secondary structure information significantly increases the sensitivity of HHpred. The output of HHpred is a list of the closest homologs, with pairwise alignments.

Necessary Resources Hardware
A desktop computer, a laptop, or a tablet with Internet access Gabler et al.

Figure 1
Submission page of HHpred. The submission page of all tools within the Toolkit, including HHpred, contains two tabs: (A) 'Input' and (B) 'Parameters.' In the 'Input' tab, the amino acid sequence of FtsZ from P. syntrophicum in FASTA format (UniProt ID: A0A5B9D775) is shown as an example, and the target database is set to PDB_mmCIF30 (version of July 23, 2020). In the 'Parameters' tab, default values are set for all parameters, except 'Max target hits' (= 500).

Software
An up-to-date, JavaScript-enabled Web browser (preferably Google Chrome, Mozilla Firefox, or Apple Safari)

Input files
A protein sequence (in FASTA format or as plain text) or a multiple protein sequence (MSA) alignment (in FASTA, STOCKHOLM, or CLUSTAL format)

Current Protocols in Bioinformatics
Example'), uploading the input sequence as a file ('Upload File'), and activating the pairwise comparison mode ('Align two sequences/MSAs'). The 'Parameters'tab provides drop-down lists for customizing different input parameters (Fig. 1B). Options to access the help pages, toggle between windowed mode and full-screen mode, enter a custom job identifier, and submit a job are provided on both tabs.
2. Paste the amino acid sequence of your protein of interest (in FASTA format or as plain text) or an MSA (in FASTA, CLUSTAL, or STOCKHOLM format) into the large textbox (Fig. 1A). Alternatively, the input sequence or MSA can be uploaded using the 'Upload File' option. Follow the Support Protocol to build a custom MSA, starting with a protein sequence of interest. In Figure 1A, we use the putative FtsZ protein of the archaeon P. syntrophicum as an example (UniProt ID: A0A5B9D775; NCBI ID: WP_147661771).
3. Select target profile HMM database(s) against which you wish to compare the query protein (Fig. 1A). For our example, we will set 'Max target hits', which controls how many matches will be displayed in the results, to 500, and use default values for all other input parameters (Fig. 1B).
5. Optionally, assign your job a custom identifier by entering one in the 'Custom Job ID' text field (Fig. 1). The identifier should contain at least two characters. If this text field is left empty, an identifier is assigned automatically.
For our example jobs, we will let the Toolkit assign identifiers automatically.
6. Start your search by pressing the 'Submit' button.
Upon submitting a job, a new tab that shows the current status of the job is appended. Also, an entry for the job is added to the job pane located on the left of the screen. This pane provides an overview of all jobs in the current session, allows their sorting by different criteria, gives access to individual jobs, and includes a search box to identify jobs. A job starts immediately or is queued depending on our compute cluster's actual load at the time of submission. If a previously completed job with identical input and parameters is found, an option to reload the results is offered. In the job pane, jobs are color-coded based on their current status: queued jobs are colored gray, running jobs yellow, completed jobs green, failed jobs red, and jobs with an identical copy in our database colored lavender. A new submission with modified parameters and target database(s) can be initiated from a running or completed job by simply switching to the 'Input' or 'Parameters' tabs.
HHpred search results 7. Typical HHpred searches take about 5 min to run through. However, searches involving long input sequences (>600 residues), large input MSAs, higher MSA generation iterations (4 or more), or multiple target databases could take hours to complete.
Gabler et al.

Current Protocols in Bioinformatics
Upon successful completion, the status tab is removed from the job view, and five new tabs are appended: 'Results','Raw Output','Probability Plot','Query Template MSA',and 'Query MSA' (Fig. 2). 8. The 'Results' tab presents information on the detected matches in a user-friendly and interactive manner (Fig. 2).

The 'Hitlist' section (Fig. 2B) provides a tabular listing of matches sorted by their probability of being a true positive (column 'Probability'). It includes columns with information on identifiers ('Hit'), descriptions ('Name'), E-values ('E-value'), raw scores ('Score'), secondary structure scores ('SS'), lengths of the aligned region ('Aligned Cols'), and lengths of the template ('Target Length'). The hits can be sorted by clicking on the column headers or filtered by a keyword using the search box above the table. Clicking on an index number in the leftmost column takes one to the corresponding query-template alignment in the 'Alignments' section.
For the calculation of probability, the raw score ('Score') and the secondary structure score ('SS') are considered. The raw score is computed using the Viterbi HMM-HMM alignment and the secondary structure score from the alignment of secondary structure assignments between query and template, as provided by PSIPRED (3-state) or determined by . The E-value is the average number of false positives (wrong hits) with a score better than the one for the given template in the target database(s). While E-values close to 0 signify a very reliable hit, an E-value of 10 indicates that about 10 wrong hits are expected to be found in the database with a score at least this good. The P-value is the E-value divided by the number of sequences in the database. It is the probability that a wrong hit will score at least this well in a pairwise comparison. Unlike Probability, E-value and P-value are calculated without taking the secondary structure score into account.    indicating that our query protein is homologous to eukaryotic tubulins. The proteins share an overall pairwise sequence identity of only ∼14%, but both possess a highly conserved GTP-binding motif. The search against the ECOD_F70 and Pfam-A databases indicated that our query protein comprises two domains, an N-terminal Tubulin/FtsZ family GT-Pase domain followed by an FtsZ-family C-terminal-like domain (Fig. 3). The search against the proteome of S. cerevisiae yielded as best matches tubulins and Dml1p, a protein involved in the partitioning of mitochondria, all at probability values greater than 97% (Fig. 4).

of 30
Current Protocols in Bioinformatics 9. The 'Raw Output' tab allows visualizing and downloading the raw output file yielded by an HHpred search (Fig. 5A). It is advisable to download and save this file for future reference.
10. The 'Probability Plot' tab displays a cumulative histogram of the hits and can be used to obtain a count of matches with probability values higher or lower than a given value.
11. The 'Query Template MSA' tab provides access to an MSA comprising the query sequence and sequences of all the obtained hits. It provides options to download the complete alignment ('Download MSA') or to forward the alignment to other tools ('Forward Selected'), either completely ('Select All') or only for individually selected sequences.
12. The 'Query MSA' tab provides access to the MSA built by the HHpred server for the query (Fig. 5B). The tab displays the 200 most divergent sequences and allows an MSA of selected or all sequences to be forwarded to other tools ('Forward Selected'). This tab also includes options for downloading this reduced query alignment or the full alignment in A3M format, a space-efficient format that we use internally to store alignments. Alignments in A3M format can be converted to FASTA using the FormatSeq tool offered within our Toolkit (https:// toolkit.tuebingen.mpg. de/ tools/ formatseq).

PAIRWISE SEQUENCE COMPARISON USING HHpred
The pairwise mode of HHpred allows the comparison of two sequences or MSAs. This is particularly useful when you wish to substantiate a homologous relationship between two proteins that you suspect to be homologous, compare proteins that do not exist in our profile HMM databases, or obtain an HMM-HMM based alignment of two distantly related proteins. HHpred builds MSAs for the two input sequences using HHblits or PSI-BLAST, assigns secondary structure using PSIPRED, and converts the annotated MSAs to profile HMMs. In the next step, it compares the computed HMMs and reports an alignment if a match is found that satisfies the cutoffs set in 'Parameters'. For proteins that contain multiple homologous repeats or domains, it typically reports two or more alignments. For detailed information on using HHpred, please refer to Basic Protocol 1.

Necessary Resources
Same as for Basic Protocol 1 2. Click on the switch labeled 'Align two sequences/MSAs', located below the query textbox, to activate the pairwise comparison mode of HHpred. A second sequence input textbox will be shown (Fig. 6).
3. Paste the amino acid sequences of your proteins of interest (in FASTA format or as plain text) or two MSAs (FASTA, CLUSTAL, or STOCKHOLM format) into the two textboxes. Alternatively, upload them using the 'Upload File' option.
4. Customize the input parameters in the 'Parameters' tab. Refer to step 4 of Basic Protocol 1 for more information on this.
For our example, we will use default values for all input parameters.
5. Optionally, assign your job a custom identifier by entering one in the 'Custom Job ID' text field.
6. Start your search by pressing the 'Submit' button.
HHpred search results 7. Typical pairwise comparisons take about 5 to 10 min. However, searches involving long input sequences (>600 residues) could take several hours to complete.

of 30
Current Protocols in Bioinformatics 'Query MSA' (Fig. 7). For a detailed description of these tabs, please refer to steps 7-12 of Basic Protocol 1. Figure 7 shows the outcome of comparing P. syntrophicum FtsZ with human Tubulin alpha-1A. In a comparison performed on September 3, 2020, HHpred matched them with a probability value of 99.85% and an E-value of 3.3e-24. Although the two sequences are very divergent and share a pairwise sequence identity of only 14%, the GTP-binding site is very clearly conserved in both sequences.

BUILDING A CUSTOM MULTIPLE SEQUENCE ALIGNMENT USING PSI-BLAST AND FORWARDING IT AS INPUT TO HHpred
The sensitivity of HHpred searches depends significantly on the quality of MSAs built for the query sequence. For building these MSAs, by default the HHpred server uses three iterations of HHblits over UniRef30 or allows using PSI-BLAST over nr70. Occasionally, however, the query MSAs may not be diverse enough or may be corrupted due to the inclusion of non-homologous sequences, resulting in no statistically significant matches or false positives, respectively. In such cases, using custom-built MSAs as input may significantly increase the sensitivity and reliability of an HHpred search. In the following, we show how to build a custom MSA using PSI-BLAST over a user-selected sequence database, such as the nonredundant protein sequence database (nr), UniProtKB/TrEMBL

Figure 7
Results of comparing P. syntrophicum FtsZ with human Tubulin alpha-1A. HHpred substantiated a homologous relationship between them at a probability value of 99.85% and an Evalue of 3.3e-24. Although the two sequences share a pairwise sequence identity of only 14%, the pairwise alignment shows the clear conservation of the GTP-binding motif.

Necessary Resources
Same as for Basic Protocol 1  2. Paste the amino acid sequence of your protein of interest (in FASTA format or as plain text) or an MSA (in FASTA, CLUSTAL, or STOCKHOLM format) into the large textbox (Fig. 8A). Alternatively, the input sequence or MSA can be uploaded using the 'Upload File' option.
In Figure 8A, we use the amino acid sequence of FtsZ from P. syntrophicum in FASTA format as an example (UniProt ID: A0A5B9D775; NCBI ID: WP_147661771).
3. Select a target protein sequence database in the drop-down list over which you wish to build the alignment (Fig. 8A).

of 30
Current Protocols in Bioinformatics Figure 9 'Results' tab of PSI-BLAST. On September 3, 2020, our search with P. syntrophicum FtsZ over nr_arc70 returned 459 matches (A); 'E-value cutoff for reporting' was set to 1e-10 and 'Max target hits' to 1000. An MSA of these matches can be forwarded as input to HHpred (B).
many closely related sequences, and searches over them predominantly return only such sequences. However, the nr70 database is a good choice for most standard cases, as it reduces redundancy while maximizing diversity. Nonetheless, nr70 might still be too redundant for certain proteins, with PSI-BLAST searches over it only returning closely related sequences. In such cases, we recommend using nr50, because it offers a good compromise between redundancy and diversity.
For our example sequence, we will run the search over the nr_arc70 to build the alignment (Fig. 8A).
Gabler et al.

of 30
Current Protocols in Bioinformatics 4. Customize input parameters in the 'Parameters' tab (Fig. 8B).

Detailed information on each parameter is available in the help pages. To build a deep alignment, one could make the search criteria less stringent by using permissive cutoffs for 'E-value cutoff for reporting' and 'E-value cutoff for inclusion', or use more search iterations ('Number of iterations'). Contrarily, to build a focused alignment, use stringent cutoffs and just a single search iteration.
For our example, we will set 'E-value cutoff for reporting' to 1e-10 and 'Max target hits' to 1000 (Fig. 8B).
5. Optionally, assign your job a custom identifier.
6. Start your search by pressing the 'Submit' button.

PSI-BLAST search results
7. Typical PSI-BLAST searches take about 3 min. However, searches involving long input sequences (>600 residues) over large databases such as nr could take much longer to complete.  (Fig. 9A).
Forwarding an alignment to HHpred 8. Inspect the obtained hits and unselect spurious ones.

By default, all hits with an E-value better than the value set for the 'E-value cutoff for inclusion' are selected. To exclude seemingly non-homologous hits, inspect all pairwise alignments in the 'Alignments' section, especially those at the non-significant end, and unselect them. It is also often useful to exclude short alignments resulting from partial sequences in the databases.
9. Click on 'Forward' in the floating toolbar to forward an alignment of hits to HHpred. These can be the ones selected in step 8 ('Selected'), or all hits satisfying an E-value cutoff ('E-value better than'). In the 'Forward' modal, select 'HHpred' in the selection list at the bottom left corner and click on the 'Forward' button to send the alignment to HHpred (Fig. 9A).
This will provide an entry to step 2 of Basic Protocols 1 and 2 (Fig. 9B).

CALCULATION OF HOMOLOGY MODELS USING HHpred AND MODELLER
The availability of 3D structures is extremely useful for the functional characterization of proteins. However, for many proteins, no experimental structures are available. Since proteins with recognizable sequence similarity generally also have quite similar 3D structures, a structure for a protein of interest can be modeled computationally from its sequence, based on homology to proteins of known structure. This approach is referred to as comparative modeling or homology modeling. In the following, we show how to use HHpred to select homologous templates of known structure for a query protein and how to extract and subsequently forward their alignment to MODELLER (Sali & Blundell, 1993), a popular program for homology modeling. Please refer to Basic Protocol 1 for detailed instructions on using HHpred.

Necessary Resources
Same as for Basic Protocol 1 Gabler et al.

of 30
Current Protocols in Bioinformatics We use the amino acid sequence of FtsZ from P. syntrophicum as an example (UniProt ID: A0A5B9D775; NCBI ID: WP_147661771); please refer to Figure 1A. 3. Select either PDB_mmCIF70 or PDB_mmCIF30 as the target database. If other target databases are included, the option for modeling is not offered.
For our example, we will run the search over the PDB_mmCIF30 database (Fig. 1A).
4. Customize input parameters in the 'Parameters' tab. The default values for the various parameters are set to yield the best results for most standard cases, and we recommend using them, at least in the initial steps of the analysis.
For our example, we will use default values for all input parameters, except 'Max target hits' = 500 (Fig. 1B).
5. Optionally, assign your job a custom identifier.
6. Start your search by pressing the 'Submit' button.
Selecting templates for modeling 7. On the 'Results' page of HHpred, analyze the obtained templates and select one or more as wished (Fig. 10A). Next, click on 'Model using selection' in the floating toolbar offered at the top of the tab to generate an alignment of the query and the selected templates. If a user clicks on 'Model using selection' without having selected any template, the first hit is used for modeling.
Refer to Basic Protocol 1 for an explanation of the HHpred 'Results' page. The quality of the calculated homology model depends primarily on the selected templates. We recommend inspecting the probability values, E-values, identities, secondary structure scores, and alignments to identify the best templates. High-scoring templates with lengths similar to that of the query and few gapped columns in the pairwise query-template alignment generally represent the best templates.
On September 3, 2020, our search yielded several high-scoring templates with the same length as our query. We selected the first five matches as templates for modeling (Fig. 10A).
8. After clicking on 'Model using selection', the query-template alignment is displayed in PIR format in a new job view after a few minutes (Fig. 10B). Click on the 'Forward to MODELLER' button to send this alignment as input to MODELLER.

Running MODELLER
9. On the submission page of MODELLER, enter your MODELLER key (Fig. 11A).
To obtain a key for MODELLER, please register at the following site: http:// salilab. org/ modeller/ registration.shtml. MODELLER is available free of charge to academic users.

Figure 10
Selection of templates for modeling. An HHpred search with P. syntrophicum FtsZ over PDB_mmCIF30 (performed on September 3, 2020) returned several high-scoring matches. The first five hits were selected as templates for modeling (A), a query-template alignment was generated by clicking on 'Model using selection' (B), and forwarded to MODELLER.
11. Typical MODELLER jobs take less than 5 min to run through. In the resulting view, the modeled structure is displayed in the NGL Viewer application (Rose et al., 2018) for molecular visualization, and an option is also provided to download the atomic coordinates (Fig. 11B). Figure 11B shows the structural model generated for FtsZ from P. syntrophicum.
Gabler et al.

of 30
Current Protocols in Bioinformatics Figure 11 Submission and output pages of MODELLER. The shown input (A) was generated as described in Basic Protocol 2, step 8. The 3D structural model built for P. syntrophicum FtsZ is shown in B.

CLUSTER ANALYSIS USING CLANS
Although HHpred is extremely powerful in detecting remote homologs of a protein of interest, it may occasionally not find any meaningful connections because the available target databases contain only a significantly filtered subset of known sequences. Since building profile HMMs is computationally expensive, we currently do not include profile HMMs of large sequence databases for HHpred. In the following, we show how the speed of PSI-BLAST and the power of all-against-all pairwise comparisons can be used to detect remote homologs. For this, we will exploit the observation that the Gabler et al.

of 30
Current Protocols in Bioinformatics non-significant part of PSI-BLAST searches frequently contains many biologically meaningful connections (indeed, for most proteins, most homologs have non-significant scores). These non-significant pairwise connections nevertheless collectively allow sequences to cluster within a larger sequence space, revealing family relationships. Here, we combine PSI-BLAST with our cluster analysis tool, CLANS (Frickey & Lupas, 2004), to illustrate this approach. CLANS is an implementation of the Fruchterman-Reingold graph-drawing algorithm. It represents protein sequences as points in a virtual 2D or 3D space and allows them to attract or repel each other in proportion to the statistical significance of their all-against-all pairwise comparison. CLANS then searches for the global energy minimum of this landscape of forces, yielding a map in which related sequences group together in connected clusters and unrelated ones drift to the periphery. Here, we will collect input sequences by exploiting the Toolkit implementation of PSI-BLAST, which allows changing the target database between iterations. We will start by building a high-quality PSI-BLAST profile on a small, focused database (nr_arc70) using a strict E-value cutoff (1e-10), then search for all potential homologs up to an E-value of 100 in a more comprehensive database (uniport_prot). We will then forward the obtained hits to the CLANS tool within the Toolkit to carry out an all-against-all sequence comparison. Finally, we will load the resulting CLANS file into the CLANS desktop application to visualize the relationships between the hits and identify groups of related sequences.

Necessary Resources
Same as for Basic Protocol 1, but additionally, the Java Runtime Environment or Java Development Kit (https:// www.oracle.com/ java/ technologies/ javase-jre8-downloads.html or https:// openjdk.java.net/ install) needs to be installed on the hardware, and at least 4 GB RAM is recommended PSI-BLAST-Iteration 1 1. Navigate your Web browser to the submission page of PSI-BLAST at https:// toolkit. tuebingen.mpg.de/ tools/ psiblast.
2. Paste the amino acid sequence of your protein of interest (in FASTA format or as plain text; Fig. 12A). Alternatively, the input sequence can be uploaded using the 'Upload File' option.
In Figure 12A, we use the amino acid sequence of FtsZ from P. syntrophicum in FASTA format as an example (UniProt ID: A0A5B9D775; NCBI ID: WP_147661771).
3. Select a target database. For more information on this step, please refer to Support Protocol.
For our example sequence, which originates from an archaeon, we will create a focused alignment to be used as the input for the second iteration of PSI-BLAST by searching for its homologs in nr_arc70.

Customize input parameters in the 'Parameters' tab. Detailed information on each parameter is available in the help pages.
For our example, we will use a stringent E-value cutoff ('E-value cutoff for reporting' = 1e-10) to reduce the chances of obtaining spurious hits. We will set 'Max target hits' to 5000 and use default values for all other parameters.
5. Optionally, assign your job a custom identifier and click on the 'Submit' button to start the first iteration of PSI-BLAST.
Typical PSI-BLAST searches take about 3 min. However, searches involving long input sequences (>600 residues) over large databases such as nr could take much longer to complete. Upon completion of the job, the 'Results' tab of PSI-BLAST will be shown. Figure 12 'Input' and 'Results' tabs of PSI-BLAST-iteration 1. In the first iteration of PSI-BLAST with P. syntrophicum FtsZ over nr_arc70 (A), 459 hits were identified; 'E-value cutoff for reporting' was set to 1e-10 and 'Max target hits' to 5000. An MSA of these hits was forwarded as input to PSI-BLAST (B).

PSI-BLAST-Iteration 2
6. To initiate the second iteration using an MSA of the obtained hits, forward these to PSI-BLAST by clicking on 'Forward' in the floating toolbar on the 'Results' page of PSI-BLAST (Fig. 12B). 'PSI-BLAST' is selected by default. Press the 'Forward' button to forward the MSA. 7. On the submission page of PSI-BLAST, select a target protein sequence database (Fig. 13A) and customize input parameters for the second iteration of PSI-BLAST.
For our example, we will search the more comprehensive uniprot_sprot database (Fig.  13A). Since we wish to detect biologically interesting matches among lower-scoring sequences, we will set the 'E-value cutoff for reporting' to 100. We will use default values for all other parameters, except 'Max target hits' (= 5000).
8. Optionally, assign your job a custom identifier and click on the 'Submit' button to start the second iteration of PSI-BLAST.
Gabler et al.

of 30
Current Protocols in Bioinformatics Forwarding sequences to CLANS 9. On the 'Results' page of PSI-BLAST, click on 'Select All' to select all obtained matches (Fig. 13B). By default, only matches with E-values lower than the 'E-value cutoff for inclusion' are selected.
10. Click on 'Forward' to send sequences of the obtained matches to CLANS. In the 'Forward' modal, select 'CLANS' in the drop-down list at the bottom left corner, select 'Full-length sequences', and click on the 'Forward' button ( Fig. 13B).
11. On the submission page of CLANS, click on the 'Submit' button to start the job (Fig. 14A).

Visualizing sequence relationships in CLANS
12. Depending on the number and length of the input sequences, the CLANS server needs up to several hours to perform all-against-all pairwise PSI-BLAST comparisons. The output page of CLANS offers hyperlinks to the computed raw file in ZIP format, which contains the all-against-all pairwise similarity scores as measured Gabler et al.

of 30
Current Protocols in Bioinformatics by PSI-BLAST P-values, the CLANS application (clans.jar), and a user guide (Fig. 14B).
13. Download the zipped raw file and extract it. 15. In the cluster map loaded within the CLANS application, protein sequences are shown as dots (nodes). Initially, they are placed randomly in a 3D space. Lines (edges) connecting the dots can be shown by selecting 'show connections' in the bottom panel. They reflect the significance of the sequence similarity between the sequences; the darker a line, the higher the sequence similarity.
To generate a publication-quality image, we recommend using a 2D space instead of the default 3D space (Misc > Cluster in 2D).
16. Click on 'Start run' in the bottom panel to start a clustering run. 18. Analyze the obtained clusters, annotate them, and produce a publication-quality image of the cluster map.
Gabler et al.

of 30
Current Protocols in Bioinformatics Figure 15 Cluster map of FtsZ and Tubulin sequences. The raw CLANS file obtained in Basic Protocol 3, step 13 was loaded, clustered, and annotated in the CLANS desktop application. In the map, dots represent sequences, and line coloring reflects sequence similarity; the darker a line, the higher the similarity. Tubulins are colored red, FtsZ magenta, and CetZ green.

Current Protocols in Bioinformatics
In the map obtained on September 3, 2020, FtsZ (colored magenta) and Tubulin (red) sequences formed connected, central clusters (Fig. 15), additionally including a cluster of FtsZ-like euryarchaeal proteins (CetZ; green). All other sequences are located at the periphery of the map and are not connected to these clusters. In a situation in which possible evolutionary connections have not yet been analyzed (as they have been in the FtsZ/Tubulin superfamily), this cluster map could have formed the basis for computational and experimental investigations on the assumption of homology.

Background Information
Two proteins are said to be homologous if they descended from a common ancestor. Generally, homologous proteins have a similar structure and, depending on the degree of divergence, similar functions, cellular localization, or ligands. Since homology of proteins offers a rich source of functional and structural information, the inference of homology has become an essential tool in molecular biology research and underpins the use of model organisms to study biological processes. The divergent evolution of proteins from hypothetical ancestral forms is generally inferred from the similarity of modern representatives. These comparisons are usually made with sequence data because sequence space is essentially infinite and convergence by chance is, therefore, improbable. In contrast, the number of folded conformations available to the polypeptide chain is limited. Hence, unrelated proteins tend to converge on similar structural solutions, especially at the subdomain level. Also, sequence data are easier to obtain than structural data and, thus, more plentiful by orders of magnitude. Over the years, many different sequence comparison methods have been developed. They achieve different levels of sensitivity, depending on the amount of information they incorporate. Methods that compare individual protein sequences, such as BLAST, are the least sensitive, as they use only the information from the pairwise comparison of two sequences, scored by a global substitution matrix. An additional level of sensitivity is achieved by methods that compare sequence profiles to sequences, such as the iterated version of BLAST, PSI-BLAST. Profiles record the frequencies of the 20 amino acids for each column of an MSA and therefore include family-specific information for the query sequence. Profileto-profile comparison methods, such as COMPASS (Sadreyev, Tang, Kim, & Grishin, 2009), provide an additional improvement by using family-specific information for both sequences being compared. Incorporation of position-specific insertion and deletion frequencies into profiles yields profile hidden Markov models (HMMs). Methods based on HMM-to-HMM comparison, such as HHpred, are currently our most sensitive tools in the detection of sequence similarity.

Understanding Results
The protocols presented here are meant to allow non-expert users to identify distant homologs to their protein of interest and evaluate potential structural similarities. Since the inference of homology becomes progressively more difficult with decreasing pairwise sequence identity, we focus here on such difficult, divergent protein pairs. The region between 20% and 35% pairwise sequence identity has been named the 'twilight zone', and the region below it the 'midnight zone' (Rost, 1999); for almost all proteins, most homologs are in these zones. Hence, progress in sequence search tools has been mainly measured by their ability to substantiate homology far into the midnight zone. HHpred, the main tool discussed here, is generally acknowledged as the currently best-performing tool for sequence comparisons. Nevertheless, there are instances where it gives scores indicative of homology to proteins not actually related to the query. It is therefore important to evaluate the plausibility of its results based on a few simple guidelines.
• Check probability and E-value: The probability value reported by HHpred for a match to be a true positive is the most important criterion to infer if a match is homologous to the query or is just a high-scoring chance hit. When it is greater than 95%, evolutionary relatedness is highly likely. Typically, one should give a match serious consideration if it has a probability value >50%, or it has a probability value >30% and is among the top three hits. The E-value is an alternative measure of statistical significance. It is the number of chance hits with a score better than the one for the given match that is expected to be found in the target database. The lower the E-value, the more significant the match is. Unlike the truepositive probability, the HHpred E-value does not take secondary structure similarity into account. Thus, it is a less sensitive measure than Gabler et al.

of 30
Current Protocols in Bioinformatics the probability. Consequently, even when the E-value is ∼1, matches can be significant by the probability criterion.
• Check secondary structure similarity: If the secondary structure of the query and match is substantially different, the match is probably a false positive.
• Check relationships among top hits: If several of the top matches are homologous to each other, for instance, when they are members of the same SCOPe superfamily or ECOD homology level, then their likelihood of being homologous to the query is very high.
• Check if homology is biologically suggestive: Does the database hit have a function you would also expect for your query? Does it come from an organism that is likely to contain a homolog of your query protein?
• Check for possible conserved motifs: Most homologous pairs of proteins will have at least one (semi-)conserved motif in common. You can identify such putative (semi-)conserved motifs by inspecting HHpred alignments for clusters of three or more well-matching columns (marked with a '|' sign in the row between the query and template consensus sequences) and also by matching consensus sequences. Some false positive matches may have high scores due to possessing an amino acid composition similar to that of the query. In such cases, the alignments tend to be long and lack conserved motifs. You could also scan the alignments for motifs known to be involved in enzymatic function or binding of ligands, such as the GTP-binding motif discussed in this report.
• Check query and template alignments: A corrupted query or template alignment is the main source of high-scoring false positives. The two most common sources of corruption in an alignment are (1) nonhomologous sequences, especially repetitive or low-complexity sequences in the alignment, and (2) non-homologous fragments at the ends of the aligned database sequences. Inspect the query and template MSAs for the presence of spurious sequences. In fact, the HHpred server displays an alert message when coiled-coil, transmembrane, or low-complexity segments are detected in the query.
• Check if you can reproduce the results with other parameters: For instance, if you expect the query to be globally homologous to the putative homolog, you could re-run the search using the global alignment mode instead of the local one. You could turn off secondary structure scoring if you suspect that the match between the query and template was scored highly because of a chance similarity of their PSIPRED-predicted or DSSPdetermined secondary structures. You can also run the query over other databases to check if similar matches are returned.