Using ExpressAnalyst for Comprehensive Gene Expression Analysis in Model and Non-Model Organisms
Published in the Bioinformatics section
Abstract
ExpressAnalyst is a web-based platform that enables intuitive, end-to-end transcriptomics and proteomics data analysis. Users can start from FASTQ files, gene/protein abundance tables, or gene/protein lists. ExpressAnalyst will perform read quantification, gene expression table processing and normalization, differential expression analysis, or meta-analysis with complex study designs. The results are presented via various interactive visualizations such as volcano plots, heatmaps, networks, and ridgeline charts, with built-in functional enrichment analysis to allow flexible data exploration and understanding. ExpressAnalyst currently contains built-in support for 29 common organisms. For non-model organisms without good reference genomes, it can perform comprehensive transcriptome profiling directly from RNA-seq reads. These common tasks are covered in 11 Basic Protocols. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC.
Basic Protocol 1: RNA-seq count table uploading, processing, and normalization
Basic Protocol 2: Differential expression analysis with linear models
Basic Protocol 3: Functional analysis with volcano plot, enrichment network, and ridgeline visualization
Basic Protocol 4: Hierarchical clustering analysis of transcriptomics data using interactive heatmaps
Basic Protocol 5: Cross-species gene expression analysis based on ortholog mapping results
Basic Protocol 6: Proteomics and microarray data processing and normalization
Basic Protocol 7: Preparing multiple gene expression tables for meta-analysis
Basic Protocol 8: Statistical and functional meta-analysis of gene expression data
Basic Protocol 9: Functional analysis of transcriptomics signatures
Basic Protocol 10: Dose-response and time-series data analysis
Basic Protocol 11: RNA-seq reads processing and quantification with and without reference transcriptomes
INTRODUCTION
With the fast progress in sequencing and mass spectrometry technologies, studies involving omics data collection are becoming ubiquitous in life sciences. Making sense of these large, complex omics datasets require advanced and specialized analysis pipelines, and many researchers do not have the bioinformatics or programming skills to handle these data (Alyass et al., 2015). There is an urgent demand for user-friendly software to relieve the omics data analysis bottleneck. Here we provide detailed protocols on using ExpressAnalyst, a web-based platform that provides end-to-end support for common tasks involved in transcriptomics data analysis (Liu et al., 2023). While many of the modules were originally designed for RNA-seq or microarray data (Zhou et al., 2019), we have added proteomics-specific annotation libraries and normalization methods so that the differential expression and functional analysis methods can be used to analyze abundance tables from proteomics.
The core statistical and functional analysis modules were originally part of the NetworkAnalyst tool, and our previous protocol (Xia et al., 2015) covers some of this functionality. The general statistical and functional analysis modules were split from the network analysis module to form the basis of ExpressAnalyst. ExpressAnalyst was expanded to include bulk RNA-seq processing, annotation and functional libraries for ecological species (common reference transcriptomes and Seq2Fun ortholog IDs), as described in our recent publication (Liu et al., 2023). All modules were further modified to support complex metadata, including continuous variables and the ability to consider multiple factors during differential expression analysis, and to support proteomics intensity/abundance tables. Finally, ExpressAnalyst integrates the FastBMD workflow to enable dose-response and time-series analysis (Ewald et al., 2021). The web interface has also been configured to display a live R command history throughout the analysis. Users with basic R scripting skills can install the ExpressAnalystR package (see Internet Resources) for batch processing, transparent, and reproducible analysis. The command history can also be reported as supplementary materials in any publications using ExpressAnalyst.
A general transcriptomics analysis has four main steps: raw data processing, filtering and normalization, statistical analysis, and functional analysis (Fig. 1) (Conesa et al., 2016). Each step produces different results: raw data processing generates a table of expression values; the filtering and normalization step produces a clean, normalized table; statistical analysis generates a list of significant features; and functional analysis produces a list of impacted pathways and biological processes. ExpressAnalyst has different modules that serve as “entry points” to the general pipeline: if researchers download and save their results, they can start the analysis at any of the steps indicated in Figure 1. The various protocols that address each of the four steps are outlined in Figure 1. While RNA-seq read processing is chronologically first, we present it last (Basic Protocol 11) as it has the most complicated hardware and software requirements and is not performed frequently. RNA-seq read quantification is usually performed only once and often by dedicated bioinformaticians at a core facility, and the resulting count table is provided to researchers as the most common starting point for exploratory analysis. Basic Protocols 1 to 4 cover filtering and normalization, statistical analysis, and functional analysis of a standard RNA-seq count table. Basic Protocol 5 covers the same main steps but with a cross-species dataset that includes multiple non-model species. It highlights features within ExpressAnalyst that were designed specifically for species without high-quality reference genomes or transcriptomes. Basic Protocol 6 covers filtering and normalization methods specific for microarray or proteomics tables. Basic Protocols 7 and 8 introduce meta-analysis of a set of expression tables including filtering, normalization, and statistical and functional analysis. Basic Protocol 9 briefly covers how lists of significant features generated by previous statistical analyses can be uploaded for comparison and functional analysis. Basic Protocol 10 introduces a specialized statistical and functional analysis for dose-response or time-series expression data. Finally, Basic Protocol 11 describes a unified workflow for processing RNA-seq FASTQ files from both model and non-model species. Together, these protocols introduce how ExpressAnalyst empowers researchers to comprehensively analyze their own transcriptomics or proteomics datasets, without programming skills or advanced bioinformatics experience.
- Basic Protocol 1: How to upload, process, and normalize an RNA-seq count table in preparation for statistical and functional analysis.
- Basic Protocol 2: How to perform differential expression analysis for simple and complex experimental designs.
- Basic Protocol 3: How to perform functional analysis and interpret the results with volcano plots, enrichment networks, and ridgeline charts.
- Basic Protocol 4: How to use hierarchical clustering and heatmaps to perform an unsupervised, exploratory analysis.
- Basic Protocol 5: How to perform statistical and functional analysis of a cross-species RNA-seq count table generated by ortholog mapping with Seq2Fun.
- Basic Protocol 6: How to filter and normalize microarray and proteomics intensity tables.
- Basic Protocol 7: How to upload, process, and normalize a set of gene expression tables for meta-analysis.
- Basic Protocol 8: How to perform statistical and functional meta-analysis of gene expression data.
- Basic Protocol 9: How to analyze single or multiple gene expression signatures.
- Basic Protocol 10: How to perform dose-response and time-series analysis.
- Basic Protocol 11: How to process FASTQ files to obtain a gene count table with or without using a reference transcriptome.
Basic Protocol 1: RNA-seq COUNT TABLE UPLOADING, PROCESSING, AND NORMALIZATION
The objective of this protocol is to prepare the data for downstream differential expression and functional analysis using ExpressAnalyst. This includes formatting the input files, mapping transcript identifiers to the internal annotation database, performing a basic quality check on the data, and applying filtering and normalization to remove non-informative genes and to correct for systematic technical differences. This protocol assumes that RNA-seq reads have already been aligned to a transcriptome and summarized in a count table, which is the case for most researchers. If this is not the case and you must start from FASTQ files, please see Basic Protocol 11. This protocol is also specifically written for RNA-seq count data. ExpressAnalyst also accepts abundance tables produced from microarray or proteomics experiments. Many of the overall concepts are the same; however, count data requires specific normalization techniques. For a discussion of microarray intensity and proteomics abundance data processing, please see Basic Protocol 6.
Basic Protocols 1 to 4 use the same dataset, an RNA-seq count file measured in mouse liver (Diamante et al., 2021). It has been previously shown that bisphenol-A (BPA) exposure during pregnancy leads to cardiometabolic disease in offspring. The objective of the original study was to elucidate the mode of action underlying this outcome. The authors exposed pregnant mice to BPA and collected RNA-seq data in the liver from offspring of both sexes, along with bodyweight, insulin secretion, and targeted lipids in the liver and plasma samples. Differential gene expression analysis was conducted between the exposed and control groups to understand the observed phenotypic differences and metabolic outcomes.
Necessary Resources
Hardware
- A computer with internet access
Software
- An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Files
- None
1. Go to the ExpressAnalyst home page (https://www.expressanalyst.ca) and click the “Tutorials” link at the top menu bar to visit the tutorial page. Scroll to the bottom of the page and find the “Dataset for the ExpressAnalyst Current Protocol” data section. Download the two text files labeled “mouse_counts.csv” and “mouse_metadata.csv.” Open them in a spreadsheet program or a text editor to view the data format (Fig. 2).
The most frequent help requests that we receive are related to data and metadata formatting. The input files are displayed in Figure 2, including a gene count table for 16 samples (Fig. 2A), and a metadata table describing these samples (Fig. 2B). The first two columns in the metadata table show the main study design: 4 BPA exposed and 4 controls for both male and female mice. Liver and plasma lipids were also measured, which are the continuous values displayed in the remainder of the metadata columns.
2. Go back to the ExpressAnalyst home page and click “Start Here” to access the Module Overview page. Locate the “Statistical & Functional Analysis” section and click the “Start Here” button underneath the single gene expression table input type. On the Data Upload page, set the organism to “M. musculus (mouse),” leave the “Analysis Type” as “Differential Expression,” set the data type to “Counts (bulk RNA-seq),” and the ID type to “Official Gene Symbol.”
Users must specify the correct organism and ID type so that ExpressAnalyst can map feature IDs to its internal annotation databases. Upon data upload, all IDs are converted to Entrez IDs. When multiple features are mapped to the same Entrez ID (for example, different transcript isoforms for the same gene), their expression values are summed if the data are counts and averaged if the data are intensities (microarray, proteomics). If your data are pre-normalized counts, you should upload them as intensities.
It is possible to skip the annotation step by leaving the organism and ID type as “Not specified.” This may be desirable if your species or ID type is not supported, or if you'd like to retain the transcript-level resolution. Please note that in this case, functional analysis will be disabled as functional analysis requires gene level annotation.
3. Choose the “mouse_counts.csv” file for the data file, and the “mouse_metadata.csv” for the metadata file. Leave the “Metadata included” box unchecked and click “Submit.” Once the upload has finished, various summary messages will be displayed in the top right corner. Click “Proceed” to view this information in more detail on the next page.
ExpressAnalyst accepts files in either comma-separated (.csv) or tab-delimited (.txt) format. Users have the option of either embedding metadata in the count table and uploading a single file, or formatting metadata in a separate table and uploading two files. The former strategy can be used for simple experimental designs with one or two metadata variables, while the latter is suitable for datasets with complex metadata.
4. In the Data Quality Check page, examine the text summary of the uploaded datasets in the gray box at the top of the “Omics data overview” tab. It shows the sample size, the percentage of features that are matched to the annotation database, as well as the number and type of experimental factors.
The annotation libraries in ExpressAnalyst are updated about once per year, based on the latest ID versions available from NCBI (Entrez, RefSeq), Ensembl, and Uniprot (Brown et al., 2015; Consortium, 2019; Zerbino et al., 2018). If your data were annotated many years ago, you may have a lower percentage of features that map to the ExpressAnalyst database. Also, Official Gene Symbols generally have a lower mapping rate than the other ID types since there can be many synonyms for the same gene, not all of which may be present in our database.
5. Scroll down to view various diagnostic graphics, the first of which is the “Box plot.” Since the expression values range from zero to >10,000, it is clear that these are unnormalized count values. Click each of the additional tabs to view the “Count sum” (displays the total counts from all genes for each sample, also called the sequencing depth), “PCA plot” (scatterplots of the top two principal components), and “Density plot” (distribution of count values for each sample) of the uploaded data. The density plot appears in the shape of an “L,” which is caused by the large range and right-skewed distribution of raw count values.
The figures shown under the dataset summary are useful for visually identifying outlier samples, assessing whether the data are normalized or not, determining appropriate filtering thresholds, and providing a benchmark to compare the effects of normalization. Deciding whether a sample is an outlier that should be removed is not a straightforward process. In general, we wish to remove samples that are substantially different from other samples based on technical reasons. A sample might be different due to biological reasons, in which case it should not be removed as this will bias the downstream statistical analysis and potentially lead to incorrect interpretation of the results. Unfortunately, it is not usually possible to determine whether an outlier is due to technical or biological reasons from the data alone. One guiding principle is that biological variability tends to have a smaller range than technical variability, hence if an outlier is extreme, we can usually assume it is a technical outlier and safely remove it without compromising our statistical inference. Examples of extreme outliers are when the first principal component (PC1) explains >70% of the variability and has a single or small number of samples on the extreme end of PC1, or when the count sum of a sample is several orders of magnitude smaller than the other samples (i.e., sequencing effects).
6. Go back to the top of the page and click the “Metadata overview” tab. Scroll down to the metadata table (Fig. 3) and verify that each variable has been correctly recognized as either “Discrete” (Treatment, Sex) or “Continuous” (4 liver and 7 plasma lipid variables). The classification of each metadata variable can be updated using the dropdown menus below individual variable names. Depending on your screen size, some metadata variables may not be visible. To see the additional columns, simply scroll to the right within the table area.
Scrolling to the far right reveals a pencil can and a garbage can icon for each row, allowing users to edit or delete any metadata values or sample names. If you remove a sample from the metadata table, the corresponding sample in the omics data table will also be removed automatically.
7. Click the “Edit metadata column” button above the metadata table (Fig. 3). Navigate to the “Order (factor-level)” tab and make sure the “Treatment” variable is selected in the dropdown. By default, discrete metadata classes are sorted alphabetically in all downstream plots. However, in some cases, a different order might make more sense. Here, we wish to always plot the control samples on the left and the BPA-exposed samples on the right. Click the “Control” value and use the up-arrow button on the left to move it above “BPA” and click “Update.” Click “Proceed.”
The other tabs allow users to include/exclude metadata variables from being displayed as options during the downstream statistical analysis, as well as to specify the primary metadata. The primary metadata is used to annotate most visualizations throughout the remainder of the analysis, so users should select the metadata that they are most interested in.
8. Click “Proceed.” A dialog will appear, warning that a few missing values were detected in the metadata and explaining how they will be handled in the downstream steps. Click “OK.”
The downstream differential expression requires complete metadata for any variables included in the analysis. For example, if both the “Sex” and “liver_TG” variables are included in differential expression analysis (DEA), any sample with at least one missing value for either “Sex” or “liver_TG” will be excluded prior to computing the DEA statistics for that comparison.
9. Leave the default filtering settings (“Filter unannotated features” checked, “Low abundance” threshold to 4, and “Variance filter” to 15), change the normalization method to “Relative log expression normalization,” and click “Submit.”
Filtering out transcripts that are of low confidence or uninformative to the research context can increase the statistical power of the downstream DEA (Bourgon et al., 2010). We can use summary statistics that are agnostic to the metadata (unsupervised), such as average abundance and variability across all samples, to flag transcripts for exclusion. Transcripts with a low average abundance near the detection limit are likely unreliable, while transcripts with a very low variability across all samples are unlikely to correlate with any metadata labels. Note, one should avoid using metadata labels relevant to downstream analysis to decide which transcripts to exclude, which will introduce bias to the DEA and other supervised methods (Bourgon et al., 2010).
The purpose of normalization is to make expression profiles more comparable across samples, and to transform them to be more suitable for statistical analysis and visualization. All normalization options for RNA-seq counts (other than ‘None’) are from the “voom” methods in the limma R package which transform the data to “Log2-counts per million” or logCPM (Law et al., 2014). This transforms data to the log scale and normalizes for sequencing depth, which often varies across samples. Another potential issue with RNA-seq data relates to its compositional nature (Lovell et al., 2015), and the large range of abundances across different transcripts. As we saw in the boxplots and the density plots on the data overview page, most transcripts have counts in the low 10s to 100s range, but a small percentage have many more than this (>15,000). This means that a small number of transcripts can account for >50% of the total counts. If these highly abundant transcripts vary substantially across experimental conditions, they can influence the relative values (such as counts or logCPM) of many other transcripts, even if these transcripts do not change on an absolute scale (Lovell et al., 2015). This is a challenging issue to correct for and impacts different datasets to different extents. The last three normalization methods (“Upper Quantile Normalization,” “Trimmed Mean of M-values,” and “Relative log expression normalization”) implement different strategies to address it (Law et al., 2014).
10. Scroll down to the figures in the lower half of the page and consider the “Box plot” and “Density plot” tabs.
Both plots look very different after normalization. The normalized expression values are now below 15, do not have a right-skew distribution at the sample level, and have very similar distributions across samples. This indicates that the data have been normalized for sequencing depth and transformed to a log scale.
11. Click on the “PCA plot” tab to examine the data patterns based on principal component analysis.
Principal component analysis (PCA) is a widely used dimensionality reduction method that can summarize main variability trends in high-dimensional omics data into a few dimensions for intuitive visualization. In the PCA plot based on the first two principal components, we see that the data fall into four clear clusters (Fig. 4D). The samples are colored according to the primary metadata (Treatment), which reveals that the “Control” and “BPA” samples are separated along PC2. Inspecting the sample names, we can see that the samples are separated according to Sex along PC1, with male on the right and female on the left. This is a sign that there is a strong biological signal with respect to our main metadata of interest. Sometimes the PCA shows that the main trends in the data are related to technical variables such as batch or sample preparation protocols. This does not mean that there are no meaningful patterns in the data related to our metadata of interest, but that they explain less variability in the data than the technical parameters. In these cases, the technical parameters should be accounted for during the statistical analysis. This topic will be covered in Basic Protocol 2.
12. Click on the “Mean-variance plot” tab to explore the relationship between the mean and variance of transcript expression values.
There is typically a relationship between the mean and variance of transcript expression values (Liu et al., 2015). While trends may vary across datasets, a typical relationship for unfiltered RNA-seq data is shown in Figure 5A: moving from left to right along the x-axis, the transcript standard deviation increases for a short section, peaks, and then decreases with increasing mean expression values. The initial increasing section is where transcripts are at or near the detection limit, hence the standard deviation is lower than we would expect based on the mean expression, due to the high numbers of zero values. The goal is to eliminate the initial upswing (see the dashed red box in Fig. 5A) by setting an appropriate abundance filter, to produce a consistently decreasing mean-variance trend as in Figure 5B. If the upswing area is not removed, the abundance filter should be increased.
13. Click the “Show R Commands” link in the top right corner to view the R commands history.
The R functions used in ExpressAnalyst are publicly available on the GitHub page (https://github.com/xia-lab/ExpressAnalystR) and can be installed as an R package for local analysis (Fig. 6B). Throughout the analysis, the executed functions are tracked in the R command history. These features are implemented in ExpressAnalyst for transparency and reproducibility, so that users can see exactly which analyses have been performed.
Basic Protocol 2: DIFFERENTIAL EXPRESSION ANALYSIS WITH LINEAR MODELS
The general objective of differential expression analysis (DEA) is to identify genes or transcripts associated with specific experimental factors of interest, while accounting for other major sources of variability within the data (Law et al., 2016). The observed expression patterns can be explained by a combination of technical, biological, environmental, and experimental sources. Technical sources can include different sample preparation or sequencing depths across samples. Biological sources include factors such as sex, age, and circadian rhythm, while examples of environmental sources may encompass the geographic locations of sample collection or lifestyle parameters such as smoking or diet. Finally, experimental sources include any independent variable imposed by the researcher, such as chemical treatments or gene knockouts. In this protocol, we introduce the concepts behind using generalized linear models for performing DEA of gene expression data, explain differences between the main DEA algorithms, and describe how to configure DEA for common experimental designs.
Necessary Resources
Hardware
- A computer with internet access
Software
- An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Files
- None
1. This protocol carries on where Basic Protocol 1 left off. If you have not just completed Basic Protocol 1, please do so. Click “Proceed” to move from the “Data Filtering & Normalization” page to the “Differential Expression Analysis” page to begin this protocol.
2. There are two different parameter interfaces for DEA. The “Simple Metadata” tab is for datasets with only one or two discrete metadata variables, while the “Complex Metadata” tab can handle any number of metadata, including both discrete and continuous types. We will work primarily with the “Complex Metadata” interface. Click the “Complex Metadata” tab.
The objective of DEA is to identify genes that are related to a primary metadata of interest. For example, we may want to know which genes have different expression values in the “BPA” compared to the “Control” samples, or we may want to know which genes are associated with a continuous metadata like “Age.” We may want to additionally consider other metadata when calculating these statistics, to account for potentially confounding biological or technical factors. ExpressAnalyst uses generalized linear models to conduct DEA (Fig. 7) (Smyth, 2005). Figure 7 is a simplified representation to highlight how linear models are used. Each of the DEA algorithms makes additional corrections while fitting the model according to their specific statistical assumptions and returns a slightly different set of statistics. Users should always consult the original publications and user manuals for each of the DEA algorithms for more details.
The structure of the linear model and statistics extraction are very flexible and can support a wide range of experimental designs. For example, we could include more model terms if we needed to account for additional parameters like “Batch” or “Age,” or we could specify contrasts when extracting the statistics, such as to evaluate interactions between the modeled variables.
Both discrete and continuous variables can be included in the linear model. Continuous variables like “Age” are represented in the model by a single term. If the primary metadata is a continuous variable, then the extracted results include the model coefficient for the “Age” term. If there is a positive or negative association between “Age” and the expression values, then the coefficient will be positive or negative respectively. If the association is stronger, then the coefficient will have a value of greater magnitude. Discrete variables are slightly more complicated, as we must use ‘dummy coding’ to include them in the model. Consider a categorical variable “Dose” that has three classes: “Control,” “Low,” and “High.” While constructing the model, one will be selected as the reference class (e.g., “Control”) and two terms will be added to model: “Low:Control” and “High:Control.” If the primary metadata is discrete, the user can explicitly set the reference class and choose the specific term they would like to extract statistical results for. For example, if you set “Reference group” to “Control” and “Contrast” to “High,” the extracted statistics would be for the “High:Control” term and would describe a comparison between the ‘High’ dose samples and the ‘Control’ samples. In scenarios where the primary metadata is discrete (like this one), the extracted model coefficient is the same as the log2 fold-change (log2FC).
In the following protocol steps, we will demonstrate how linear models can be configured for different experimental designs and research questions.
3. Change the “Reference group” to “Control” and the “Contrast” to “BPA,” leaving the remaining as the default settings. The selected parameters should be as in Figure 8. Click “Submit” and then click “Proceed.”
4. The simple comparison of all “BPA” samples vs all “Control” samples results in 2957 transcripts with statistically significant differences in expression values (Fig. 9). Click the image icon under the “Graphical Summary” column for the top gene (Gm20594) to view the expression values across “Treatment” groups. Click the “NCBI” link for Gm20594 to open the gene profile in the National Center for Biotechnology Information database.
The “Complex metadata” tab uses the limma R package (Smyth, 2004), hence the columns in the Results Table are the same statistics that would be returned if you performed DEA in R with limma. The BPA-Control column shows the log2FC, which is the coefficient for the “BPA:Control” term from the linear model. The “AveExpr” column is the mean normalized expression across all samples. The “t,” “P.Value,” and “adj.P.Val” columns are the t-statistic (ratio of the BPA-Control value to the standard error), p-value, and adjusted p-value for that transcript. The “B” column is the B-statistic, also known as the log-odds. The “B” statistic is a more advanced topic and is less commonly used than the p-value to determine differential expression.
5. Click “Previous” to go back to the DEA page. Instead of looking for transcripts that are differentially expressed between treatment groups, we can consider “Sex” instead. Go the “Complex metadata” tab, change the “Primary metadata” to “Sex,” change the “Contrast” to “Male,” click “Submit” and then “Proceed.”
The default option for “Contrast” is “All contrasts (ANOVA-style).” In cases where there are three or more classes in a discrete primary metadata, and therefore more than one dummy-coded term in the model, we can calculate the p-value using an ANOVA-style test to assess for significant differences in variability across all treatment groups. Both of our discrete metadata (“Sex” and “Treatment”) have only two classes. In this case, the “ANOVA” option will give the same results as when we specify the exact comparison.
6. Now there are 2456 significant transcripts, including those that are widely known for their sexually dimorphic expression. For example, the Xist transcript responsible for X-chromosome inactivation is the 3rd on the list. Once you are done exploring the results, click “Previous” to return to the DEA page.
7. Go to the “Complex metadata” tab. In the last few steps, we identified transcripts associated with “Treatment” and “Sex” separately. However, based on the patterns in the PCA plot that we generated in Basic Protocol 1, we expect that some transcripts are impacted by both factors, and we want to consider this in our statistical analysis. Set “Primary metadata” to “Treatment,” “Reference group” to “Control,” “Contrast” to “BPA,” and select “Sex” from the dropdown next to “Covariates (control for).” Click “Submit” and then “Proceed.” Now there are 3,801 significant features, representing a substantial increase.
To better understand what's going on here, consider two genes: Rpl39 and S100a1 (Fig. 10). When we visualize the raw expression values between the “Control” and “BPA” groups, it is clear that both are differentially expressed. There is much less variability within the Rpl39 treatment groups compared to the S100a1 treatment groups, however the variability in S100a1 is not random. Inspecting the data more closely, we see that the female samples have lower expression values than the male samples in both groups. The strength and direction of the relationship with treatment is nearly identical for both sexes.
Including “Sex” in the model allows it to account for these predictable differences. Without adding sex as a covariate, the p-value for S100a1 is 2.93 × 10-6 and this transcript is the 303rd most significant in the dataset. After adding sex, the p-value drops to 2.08 × 10-11 and S100a1 is now the 4th most significant transcript. In contrast, the p-value for Rpl39 increases slightly, from 3.70 × 10-12 to 9.90 × 10-12. This is expected, as including additional terms in the linear model decreases the degrees of freedom and therefore the statistical power, hence we can expect the p-values to slightly increase when there is no relationship between the transcript and the covariate (i.e., “Sex”).
8. Click “Previous” to return to the DEA page and navigate to the “Complex Metadata” tab. Finally, we will show that continuous metadata can be analyzed as well. The metadata for this study contains measurements of four lipids in the liver and seven in the plasma, all of which are continuous (Diamante et al., 2021). While the study was designed to assess these measurements for differences between treatment groups, we can also look for transcripts associated with these values. Select ‘Liver TG’ as the primary metadata.
Notice how the option to select the reference and contrast groups are disabled when we select a continuous variable for the primary metadata, as these are only applicable for discrete variables.
9. If we expect that both lipid and transcript expression levels are perturbed by BPA exposure, we may observe a significant association between some transcripts and “Liver TG.” In this case, we may naively assume that the changes in gene expression cause the changes in lipid levels, when in reality both are caused by BPA exposure, making it a confounding factor. We can avoid this by including “Treatment” as a covariate in the model. Select both “Treatment” and “Sex” as covariates, click “Submit,” and click “Proceed.”
In addition to covariates, it is possible to include a variable in the model as a “Blocking Factor.” Covariates are modeled as fixed effects while blocking factors are modeled as random effects. These terms will be familiar to you if you have experience with mixed effects models. See Basic Protocol 5 for further discussion of fixed vs random effects.
10. There are no significant differentially expressed genes (DEGs) for this statistical comparison. This is not unexpected because this study was not designed to detect transcripts associated with the lipid concentrations. We include it to show how this type of analysis can be conducted for other datasets. Click “Previous” to return to the DEA page.
Continuous metadata variables are becoming more common as the frequency of observational omics studies increases. For example, studies that collect transcriptomics data from human tissues (e.g., plasma) typically include many covariates describing clinical, demographic, and lifestyle parameters. Many of these are continuous and correcting for several covariates is usually necessary.
11. Click the “Simple Metadata” tab (Fig. 11). This tab allows users to choose from three different differential expression algorithms: limma, edgeR and DEseq2 (Love et al., 2014; Robinson et al., 2010; Smyth, 2005). All rely on the generalized linear model framework described above to perform DEA.
The linear model described in the previous steps can be configured in additional ways to construct more sophisticated comparisons between metadata variables. However, it is challenging to support more advanced models for an unlimited number of metadata variables with unrestricted types. The “Simple Metadata” is developed to provide a simplified interface for several common advanced structures for one or two discrete factors.
The three algorithms make different assumptions about how the raw data is distributed and perform different corrections and filters according to these assumptions. Details that are important for users of ExpressAnalyst are differences in the data input, and differences in type I and type II error rates. Limma uses the normalized data as an input, while edgeR and DESeq2 expect the raw counts data, which is why the latter two algorithms are not available for analyzing microarray or proteomics data. This means that if you use edgeR or DEseq2, the filtered counts will be used for DEA while the normalized data will only be used for downstream data visualizations. We point interested readers to publications by the authors of the algorithms for more details on the specific statistical assumptions.
12. Leave “Limma” as the “Statistical method,” set “Primary Factor” to “Treatment” and “Secondary Factor” to “Sex.” Select the “Nested Comparisons” radio button. This allows us to specify any two comparisons between groups, and test whether the relationships defined by these comparisons are significantly different from each other. Here, we will explore whether some genes show different responses to BPA exposure between male and female mice. In the first dropdown, select “Control_Female vs. BPA_Female.” For the second dropdown, select “Control_Male vs. BPA_Male,” and check the box beside “Interaction only.” Click “Submit” and click “Proceed.”
There are only two statistically significant transcripts, but they do have quite distinctive expression patterns across the male and female samples (Fig. 12). The original authors discuss how in this study, BPA exposure had a sex-specific effect on both the omics data and on the higher-level metabolic phenotypes that they measured (Diamante et al., 2021). If follow-up work were to try and elucidate the mechanism, examining the top-ranked genes prioritized by this statistical analysis may be a good starting place.
Basic Protocol 3: FUNCTIONAL ANALYSIS WITH VOLCANO PLOT, ENRICHMENT NETWORK, AND RIDGELINE VISUALIZATION
DEA typically produces a long list (100s to 1,000s) of significant features. While specific transcripts may be meaningful to investigators, it is usually difficult to make sense of the overall biological themes within the list. Functional analysis aims to answer this question by testing whether certain functions or biological processes (encoded as different gene sets) are enriched in the gene expression data. ExpressAnalyst offers two main statistical approaches: overrepresentation analysis (ORA) (Sherman et al., 2022) and gene set enrichment analysis (GSEA) (Subramanian et al., 2005). In ORA, genes in each gene set are compared to the list of differentially expressed genes (DEGs). A gene set is considered overrepresented if the overlap is higher than we would expect from random chance. The p-value is typically computed using a hypergeometric distribution. While GSEA takes the entire list of genes (not just DEGs) ranked by some criteria (such as t-statistics or fold-changes), it then compares each gene set to the gene list and determines whether the genes in the gene set are enriched or concentrated at either the top or the bottom of the list. P-values are computed empirically using a permutation-based approach. In this protocol, we introduce three different visual analytics tools (volcano plot, enrichment network, and ridgeline chart) that can be used to perform functional analysis and visualize the results.
Necessary Resources
Hardware
- A computer with internet access
Software
- An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Files
- None
1. This protocol carries on where Basic Protocol 2 left off. We ended Basic Protocol 2 on a specialized statistical test that only resulted in a few DEGs. You should be on the “Significant Results” page to begin this protocol. Click “Previous” to go back to the DEA page, click the “Complex Metadata” tab, and use the same parameters as step 7 (“Treatment” - “BPA vs. Control” as primary, controlling for “Sex” as a covariate). Click “Submit” and “Proceed.” There should be 3801 significant features. Click the “Proceed” to navigate to the Visual Analytics overview page.
The results displayed in the DEA results table are the ones that will be analyzed in downstream steps. Functional analysis typically requires 100s of genes to reliably detect enriched biological themes and does not work well when only a few DEGs are detected.
2. The “Analysis Overview” allows users to explore their data using six different interactive visual analytics tools (Fig. 13). In this protocol, we will explore the features of the three tools located in the top row. The heatmap will be covered in Basic Protocol 4 and the UpSet diagram in Basic Protocol 8. Click the “Volcano Plot” button.
Please be patient as it may take up to a minute to load the interactive plot. The time is variable depending on the server load, data size and internet bandwidth.
3. A volcano plot that looks like Figure 14 should now appear on the screen. The log2FC values are on the x-axis, with positive values to the right and negative values to the left. The y-axis is based on -log10(p-value), so that more significant p-values will have higher y-axis values. For example, a p-value of 4.0 × 10-7 has a y-axis value of 6.4.
4. By default, ORA has been conducted by comparing the list of DEGs with gene sets corresponding to KEGG pathways. There are ten pathways with an adjusted p-value <0.05, although it is difficult to discern a clear overall biological theme. Both the result table and the volcano plot are interactive. As shown in Figure 14, click a row in the result table to highlight the gene members of the corresponding gene set on the volcano plot, and click a point on the volcano plot to show a box plot summary of the gene expression levels.
The “Enrichment Analysis” results table displays the top pathway results from ORA. Since there is limited space, only a few statistics are displayed, but we can see all of them by clicking the “Save” icon to download a table of the results (Fig. 15).
5. Sometimes, biological patterns are easier to interpret when we analyze the up and downregulated transcripts separately, as there are often different biological processes represented in these lists. Change the “Query” dropdown in the “Enrichment Analysis” panel on the top left to “Sig. Up” and click “Submit.” In general, we see pathways related to translation and energy generation, with pathways including “Oxidative phosphorylation,” “Ribosome,” and “Non-alcoholic fatty liver disease.” Change the “Database” dropdown to “Reactome” and click “Submit.” These significant results are more specific and follow the general pattern in the KEGG pathways with more gene sets related to the respiratory electron transport chain and translation.
ExpressAnalyst contains nine different gene set libraries for functional analysis. KEGG stands for Kyoto Encyclopedia of Genes and Genomes (Kanehisa & Goto, 2000). It initially described metabolic pathways in model organisms but has been expanded later to cover many more biological systems for ∼900 eukaryotic species and >7000 bacteria/archaea (Kanehisa et al., 2017). Reactome is a manually curated and peer-reviewed database containing known interactions and reactions between signaling and metabolic molecules (Fabregat et al., 2018). GO stands for Gene Ontology and these GO libraries contain sets of genes related to different biological processes (BP), cellular components (CC), and molecular functions (MF), collectively called GO terms (Ashburner et al., 2000). Many GO terms overlap and contain redundant information. The PANTHER GO-slim contains fewer terms that are minimally overlapping (Mi et al., 2019). The MOTIF library, obtained from the MSigDB, contains genes that share upstream regulatory motifs that theoretically can be regulated by the same transcription factors (Liberzon et al., 2011).
6. Change the “Query” dropdown to “Sig. Down” and the “Database” dropdown to “KEGG” and click “Submit.” Here, the most significant pathways are related to transcript processing, for example “Spliceosome” and “RNA transport.” Change the “Database” to “Reactome” and click “Submit.” Just as with the upregulated transcripts, the Reactome results are related to the same themes as the KEGG pathways but are more specific and detailed.
7. In the overall volcano plot, there is a group of up-regulated DEGs showing very large log2FC values (Fig. 16A). It is possible they are functionally related to each other. To investigate this hypothesis, click and drag your mouse to form a square around the transcripts highlighted in Figure 16B. This will zoom in the main plot to include only the highlighted transcripts (Fig. 16A). Next, change the “Query” to “Current View” and click “Submit” (the database should still be Reactome from the previous step). This will perform ORA with only the transcripts visible in the zoomed in volcano plot area. We see no significant results with this database. Continue exploring other gene set libraries and we notice that “PANTHER:CC” returns a significant result for “Integral component of a membrane” with ∼30 hits (the exact set of selected genes may vary depending on the precise selected area). Click the gene set row: we can now see that almost half of the visible points are highlighted (Fig. 16B).
8. Click the “Analysis Overview” link in the top navigation bar to return to the Analysis Overview page. Click “Enrichment Network.” Please be patient as this may take up to a minute to load.
9. By default, ORA is performed with KEGG pathways, and we see that the results in the enrichment analysis panel are the same as for the volcano plot. Double-click a node in the network, such as “Synaptic vesicle cycle,” to reveal each DEG in that pathway. Double-click it again to hide the genes.
By default, the network shows all pathways with a raw p-value <0.05 or the top 10 pathways ranked by their raw p-values, whichever yields more pathways. The size of the node is proportional to the number of DEGs in the pathway. The color is proportional to the p-value using a white-yellow-red color gradient in order of descending p-value; therefore, red nodes are the most significantly enriched. Two nodes are connected if >30% of their DEGs overlap. This structure allows us to see the pathway relationships. We can see there is a group of nine pathways that share many of the same DEGs.
10. Go to the top toolbar and select “Bipartite Network” from the dropdown next to “View” (Fig. 17). This exposes all DEGs within each pathway. Select “Fruchterman-Reingold” from the “Layout” dropdown. Most of the network structure is visible; however, there is one section of the network where many pathway nodes are covering each other. Click and drag some of these nodes to spread them out until you can see more of the individual pathways.
Genes are represented by the smaller nodes. Genes with positive log2FC are colored red and those with a negative log2FC are colored blue. The intensity/darkness of the color corresponds to the magnitude of the log2FC. From this, we can identify some groups of pathways that contain mostly upregulated vs mostly downregulated genes (Fig. 17).
11. Images with light backgrounds are more suitable for manuscripts and publications. Select “White” from the dropdown next to “Background.” Select “PNG Image” from the dropdown next to “Download” and click “Save” in the dialog that pops up to generate and download a high-resolution version of the network suitable for publication (Fig. 18).
12. Click the “Analysis Overview” link in the top navigation bar to return to the Analysis Overview page. Click “Ridgeline Chart.” Please be patient as this may take up to a minute to load.
13. By default, ORA has been performed with KEGG pathways and the results are displayed in the “Enrichment Analysis” panel. The ridgeline diagram orders the pathways by their p-values, with the most significant at the top. The color is also related to the p-value, with darker colors corresponding to more significant p-values. The density plots show the distribution of DEGs in that pathway based on their logFC values, allowing users to see which pathways are overall up or downregulated. Each DEG is represented by a point along the baseline of each density plot. Hover your mouse over a pathway distribution to see the full pathway name, the specific p-value, and a list of all DEGs. Click a DEG point to see its expression profile.
The “Settings” in the top-left corner allow users to manipulate the appearance of the ridgeline diagram. All pathways with an enrichment p-value less than the “Raw Pval cutoff” are displayed on the plot, up to the “Top pathway No.” The ridge height and color can be adjusted according to preference, and the shape of the gene markers can be changed. The “Sort by” options determines whether pathway ridges are ordered by p-value or median log2FC.
14. In the “Enrichment Analysis” panel, change the “Type” to “GSEA,” change the “Rank” to “T-statistic” and click “Submit.” In the “Settings” panel, change “Sort by” to “Fold-change” and click “Update.”
Since GSEA calculates enrichment scores using the full set of genes, the pathway ridges are based on all genes in the pathway, not just DEGs. This gives a better sense of whether the pathway as a whole has generally upregulated or downregulated in expression. Here, we see that some significant pathways have clearly increased or decreased expression, with a few pathways that have generally unchanged expression (logFC evenly distributed around 0) except for a few downregulated genes (Fig. 19).
As discussed in the introduction to this protocol, GSEA analyzes lists for concentrated occurrences of pathway genes, particularly at the extreme ends of the list, by keeping a running enrichment score (ES) as it proceeds down the list. The algorithm reports the p-value associated with the maximum magnitude reached by the ES for every pathway. A positive ES indicates that pathway occurrence was enriched at the beginning of the list; negative scores indicate enrichment at the end.
Users can choose either Fold-change or T-statistic from the differential expression analysis to rank genes prior to GSEA. For both methods, genes are ranked from high to low. A positive ES therefore means that that pathway was primarily upregulated, whiles a negative ES indicates downregulation.
15. Click some of the gene markers in the second half of the plot that are noticeably more downregulated than the rest to investigate the expression patterns across individual samples (Fig. 19).
The plots show considerable variability within treatment groups that cannot be explained by sex. The lowest expression values for Gapdh and Cox7a1 are very low, indicating that these values may approach the detection limit for these genes. The few highly downregulated genes may tempt us to interpret these pathways as downregulated; however, upon viewing the actual data, this interpretation does not seem to be robust. This underscores the importance of visualizing individual data points wherever possible.
16. Click the “Analysis Overview” link on the top navigation bar to go back to the overview page.
Basic Protocol 4: HIERARCHICAL CLUSTERING ANALYSIS OF TRANSCRIPTOMICS DATA USING INTERACTIVE HEATMAPS
Heatmap visualization combined with unsupervised clustering is a powerful strategy for discovering and exploring gene expression patterns in transcriptomics datasets. Heatmaps display every datapoint as a colored cell, which can reveal expression patterns across different experimental factors. We can visually assess whether a DEG truly has a relatively uniform difference across all samples in each group, or if the difference was driven by a more heterogeneous response. Hierarchical clustering analysis provides an even more informative view of the data by grouping samples and genes based on their similarity to identify inherent expression patterns that are not necessarily related to their main metadata labels (D'haeseleer, 2005). The heatmap tool in ExpressAnalyst allows users to interactively select clusters of genes and perform functional analysis to help interpret any observed patterns of interest.
Necessary Resources
Hardware
- A computer with internet access
Software
- An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Files
- None
1. This protocol requires Basic Protocols 1 and 2 for data processing and initial statistical analysis. If you have not yet completed Basic Protocols 1 and 2, please do so first. Make sure you are on the “Analysis Overview” page to begin this protocol. Find the heatmap icon and click “ORA.” Please be patient as the heatmap may take up to a minute to load.
The heatmap has different interfaces for ORA and GSEA functional analysis. The ORA heatmap emphasizes pattern discovery within a heatmap of expression values, followed by targeted functional analysis. The GSEA heatmap first performs functional analysis, and then allows users to view heatmaps for individual pathways. Since Basic Protocol 4 emphasizes exploratory analysis, we will only cover the ORA heatmap here. The GSEA heatmap will be introduced in Basic Protocol 5.
2. A dialogue should pop up, warning us that due to the high number of DEGs, only the top 3000 are visualized to ensure high performance of the interactive tools. Click “OK.”
The ORA heatmap tool has four main components: 1) a toolbar for manipulating the heatmap along the top, 2) an overview of all genes on the left panel, 3) a focus view showing genes of interest (default to DEGs) in the center, and 4) an enrichment analysis panel on the right (Fig. 20). The “Overview” heatmap is interactive. Clicking and dragging on a section of the “Overview” will put those genes in the “Focus View.”
3. By default, the genes (rows) are sorted by p-value, and the samples (columns) are in the same order that they were uploaded. The top toolbar gives options for organizing the samples and genes either by metadata (samples), p-value (genes), or different hierarchical clustering methods. Locate the “Cluster Samples” option and change its value to “Treatment.” This re-arranges the samples in both the “Overview” and in the “Focus View.”
Both the “Overview” and “Focus View” heatmaps have a metadata section at the top. Discrete metadata variables are given randomly assigned colors (“Sex,” “Treatment”) and continuous metadata are colored from blue (low) to red (high) according to their values. The main heatmap cells used to represent gene expression values are also colored from blue to red according to the value. Note users can use the “Color” option on the top toolbar to try different color gradients.
Both the continuous metadata and gene expression data are scaled to z-scores so that each row has a mean of 0 and a standard deviation of 1. Without scaling, the heatmap would mainly show differences between genes with higher and lower expression. It would be very difficult to distinguish patterns across samples for the same gene, which is generally what we are interested in.
4. Change “Cluster Samples” to “plasma_TG.” Now the samples have been re-arranged again. Look at the “plasma_TG” row in the metadata section of the Focus View and notice how the “plasma_TG” cells are now arranged from dark blue to dark red (ascending order).
5. We can also use hierarchical clustering to arrange the samples. Change “Cluster samples” to “Average linkage.”
Figure 21 shows the main concepts behind the hierarchical clustering algorithm. “Ward's method,” “Average linkage,” “Single linkage,” and “Complete linkage” are all different methods for measuring distances between two clusters (Fig. 21B). There is no best approach; we encourage users to cluster their data with multiple methods. Truly distinct clusters should emerge consistently.
Even though we do not explicitly define clusters, this process visually reveals groups of related datapoints as blocks within the heatmap. It is also useful for understanding patterns of variability within the dataset. By synchronizing the metadata with the samples after they have been re-ordered, we can see whether some samples clusters correspond to meaningful groups described by any metadata variables. For example, after clustering samples by “Average linkage,” we can see that samples are perfectly separated by “Treatment” and then, within treatment groups, by “Sex.” This is consistent with the patterns we have seen in the PCA plot in Basic Protocol 1 (Fig. 4), where PC1 (28% of variability) was related to “Sex” while PC2 (11% of variability) was related to “Treatment.”
6. Now, we wil cluster genes. Change “Cluster features” to “Average Linkage.”
We can cluster the samples and rows independently because the distance between two columns does not depend on the order of the rows, and vice versa. Now we can see very distinct patterns in the data.
7. Scroll through the “Overview” and notice the different sections with distinct patterns (Fig. 22A and 22B). The order of the rows is no longer related to the DEA results, hence interesting clusters of genes can be present anywhere in the “Overview” heatmap.
The major clusters produced by the sample “Average Linkage” clustering method overlap perfectly with “Treatment” (left side = purple = BPA; right side = green = Control), hence portions of the heatmap that have all blue on the left and red on the right are genes that are downregulated after BPA exposure. Conversely, portions that have red on the left and blue on the right were upregulated after BPA exposure.
8. Scroll to the bottom of the heatmap and find the section that matches Figure 22B (red on left; blue on right). Click and drag your mouse over the most distinct part of this section to view it in the Focus View (Fig. 23). Users can find all the gene names displayed in the current Focus View by clicking the “Show Feature Names” tab on the bottom right page.
9. Go to the “Enrichment Analysis” pane, leave the defaults (Query: “Features in Focus View,” Database: “KEGG”), and click “Submit” (Fig. 23).
Selecting genes in this way is imprecise: we are visually identifying the boundaries of clusters and it may not be possible to select all the genes with a given pattern due to the limited screen resolution. However, biological themes or functions are defined by group behaviors and can tolerate imprecisions at individual gene levels in general.
In this step, we have performed the exact same type of ORA functional analysis as in Basic Protocol 3, except that instead of performing the analysis on the list of DEGs, we are performing it on the list of genes in the cluster(s) as shown in the Focus View. For more details on ORA, please see Basic Protocol 3.
10. Hover your mouse over the left border of the “Enrichment Analysis” pane until it turns into an arrow pointing left. Click and drag to the left to expand the width of the “Enrichment Analysis” pane. Next, move your mouse inside of the ORA results table and find the right border of the pathway name column. Click and drag to the right until you can see the entire pathway names.
11. Double-click the top row in the results table (“Oxidative phosphorylation”) to view only those genes in the “Focus View” (Fig. 23). When you are done viewing, click the “Reset” button inside of the “Enrichment Analysis” pane to again view all the selected genes in the “Focus View.”
12. Click the “Show Feature Names” header in the right-hand panel to view the list of genes displayed in the focus view.
This feature is included to allow you to reproduce your functional analysis results. This list of genes can be saved along with the table of significant pathways so that it is clear how the results were generated. You can also paste a list of genes under the “Define Custom Signatures” header to select exactly the genes displayed in the Focus view. The combination of these two features should allow you to reproduce all heatmap results.
13. Scroll through the “Overview” to show sections with four distinct columns. It appears that these genes are related to both “Treatment” and “Sex.” Re-ordering the metadata according to “Sex” may make this easier to see. In the Focus View, double click on the heatmap row corresponding to the “Sex” to reorder. Alternatively, use the top toolbar and set “Cluster samples” to “Sex.”
The row or gene order stays the same as before, but the samples are re-ordered so that they correspond to sex (left side = female = red; right side = male = blue).
14. Notice how the sections that showed uniform up or downregulation across both sexes (Fig. 22A and 22B) now look like four alternating bands (Fig. 22C and 22D). We will skip over these sections. Instead, look for the sections that look like Figure 22E and 22F. The pattern in Figure 22E shows genes that are downregulated after exposure to BPA in both female and male mice. Figure 22F shows genes that are upregulated after exposure in both sexes.
These sections look different than the ones in Fig. 22, panels A to D, because the baseline control expression values were different in male and female mice. The genes are regulated in the same direction and by about the same magnitude, they just started at different levels.
15. Select the section of genes matching the pattern in Figure 22E and perform functional analysis with several different databases.
The significant results are related to protein digestion and absorption (KEGG), extracellular matrix (GO:CC), and extracellular space (PANTHER:CC). These pathways and gene sets are overrepresented in genes that are upregulated after BPA exposure and have generally higher expression in female mice.
16. Select the section of genes matching the pattern in Figure 22F and perform functional analysis with several different databases.
The significant results are related to protein processing (KEGG), proteolysis (PANTHER:BP), and cytoplasm and mitochondria (PANTHER:CC). These pathways and gene sets are overrepresented in genes that are downregulated after BPA exposure and have generally higher expression in male mice.
17. Go to the navigation bar at the very top of the page (above the Heatmap Toolbar) and click the “Download” link to view the results from Basic Protocols 1 to 4 (Fig. 24).
Basic Protocol 5: CROSS-SPECIES GENE EXPRESSION ANALYSIS BASED ON ORTHOLOG MAPPING RESULTS
RNA-seq analysis for non-model species can be challenging due to the lack of high-quality and well-annotated reference genomes. The Seq2Fun algorithm solves this problem by aligning short reads to a large database of protein sequences from over 600 species that have been organized into ortholog groups (EcoOmicsDB). Functional information is shared across all species to give each ortholog group a gene symbol, gene/protein description, and KEGG pathway and GO term annotations. RNA-seq reads from any species can be mapped to the same database and therefore to the same set of Seq2Fun IDs. This framework greatly simplifies comparisons of cross-species transcriptomic results compared to using reference genomes where cross-species orthologs must be first identified using BLAST. Read quantification with Seq2Fun and EcoOmicsDB is covered in Basic Protocol 11. This protocol covers statistical and functional analysis of a cross-species Seq2Fun ortholog count table. The overall workflow uses many concepts from Basic Protocols 1 to 3. Here, we focus on aspects particular to cross-species analysis with Seq2Fun results. Please refer to Basic Protocols 1 to 3 for more details on general RNA-seq data processing, statistical analysis, and functional interpretation.
The data analyzed in this protocol is from a study that aimed to identify transcriptomic signatures of tissue regeneration that is conserved across salamander species, which are able to re-grow lost limbs (Dwaraka et al., 2019). A single limb was amputated from three different species of salamanders (A. andersoni = AND, A. maculatum = MAC, A. mexicanum = MEX). A small tissue sample was collected at the time of amputation (time0) and after 24 hr (time24). There were three replicates in each species by time group, resulting in 18 total samples. One sample that appeared to be a technical outlier was removed. In this analysis, we identify genes that are consistently differentially expressed between time24 and time0 across all three species.
Necessary Resources
Hardware
- A computer with internet access
Software
- An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Files
- None
1. Go to the ExpressAnalyst home page and click “Start Here” to view the module overview. Find the “Statistical & Functional Analysis” section and click the “Start Here” button underneath the single gene expression table input type.
2. Click the “Try Examples” button in the bottom left of the page. Click the “Non-model organisms” link to open the data that we will analyze in a new tab.
ExpressAnalyst accepts metadata in either a separate table (as in Basic Protocol 1) or embedded in the omics table. The format for embedding metadata is shown in Figure 25. Rows with metadata labels should be directly under the top row that contains the sample IDs. Each row should start with “#CLASS:varName” in the first cell, where “varName” is replaced with the actual metadata variable name.
3. Select the radio button next to “Non-model organisms” and click “Submit.” When the data finishes loading, click “Proceed.”
4. As in Basic Protocol 1, we can see from the boxplot and density plot that the data are un-normalized RNA-seq counts. Click “Proceed” to go to the “Data Filtering & Normalization” page.
The general workflows for data processing, statistical analysis, and functional analysis of an RNA-seq count table were covered in detail in Basic Protocols 1 to 3. This protocol will only focus on details specific to Seq2Fun results and cross-species analysis. Please refer to Basic Protocols 1 to 3 for more details on each page.
5. Leave the defaults (low abundance, 4; variance filter, 15; normalization, “Trimmed Mean of M-values”) and click “Submit.”
From the boxplots and density plot, we can see that the data are now normalized on a log scale. The samples show clear separation according to both time and species in the PCA plot (Fig. 26). The time labels look partly overlapping until we consider species; once we consider species, we see that samples are consistently separated along PC2 where time24 samples have higher PC2 values than time0. The relative separation of species along PC1 is consistent with their taxonomic relationship: MEX and AND samples are closer together, and their species were estimated to diverge ∼4.3 million years ago compared to ∼21.5 million years for MAC (Hedges et al., 2015). This plot shows one advantage to using Seq2Fun for cross-species comparisons, as the data can be directly integrated across species and visualized together.
6. For DEA, we are interested in identifying a transcriptomic signature for tissue regeneration that is conserved across species. This dataset has only two categorical metadata, making it a good fit for the “Simple Metadata” tab. Stay on the “Simple Metadata” tab, leave “Limma” selected, leave “Time” as the “Primary Factor,” and select “Species” as the “Secondary Factor.” Check the box next to “This is a blocking factor.” Leave “Specific Comparison” selected but change the order in the dropdown to “time24” first and “time0” second (Fig. 27). Click “Submit” and then “Proceed.”
When we initially select “Species” as the secondary factor, the options in the dropdown become more complicated, for example the default options for “Specific comparison” change from “time0” and “time24” to “time0_MEX” and “time0_AND.” Once we click “This is a blocking factor,” we tell ExpressAnalyst that “Species” should be considered a random effect instead of a fixed effect, and the options simplify back to “time0” and “time24” again. A random effect adds a hierarchical structure to our linear model. Instead of fitting a model with two terms (time and species), we fit a model with one term (time) separately to data points from each species. Since the random effect specifies this structure, “Species” cannot be used to construct specific comparisons and thus disappears from the contrast options.
For the specific comparison, the fold-changes are computed as First Group/Second Group. Thus, we change the order so that in the results, DEGs with a positive log2FC correspond to increased expression in time24 compared to time0. This has a more intuitive interpretation.
7. There are 2061 DEGs (adjusted p-value <0.05). View the expression patterns for each of the top two genes by clicking the image icon in the “Graphical Summary” column (Fig. 28). The plots clearly show that the top genes have a relationship with time that is conserved across species.
8. Click the “EODB” link for the most significant gene (PLEK2) to open the Seq2Fun ortholog profile in EcoOmicsDB.
EcoOmicsDB has a profile for each Seq2FunID (Fig. 29A) to provide species-specific details and to show the taxonomic groups represented in the ortholog group (Liu et al., 2023). Each ortholog group is a collection of protein sequences that were grouped based on their similarity. The table (Fig. 29C) shows each of the species-specific protein/gene information, including the Entrez ID, gene symbol, gene description, KEGG pathways, and GO terms.
The overall, ortholog-level symbol, description, and functional annotations displayed in the profile header were derived by considering all the species-specific information and choosing the most frequent occurrences. The word cloud at the top of the page was generated from the list of symbols and descriptions from the species-specific information: the more frequent words and phrases appear larger in the image.
Finally, the species sub-group hits plot (Fig. 29B) shows the percentage of each taxonomic group that has an entry in the queried ortholog group. PLEK2 is robustly defined for vertebrate species. Other orthologs are more narrowly or broadly defined, for example with only fish species contributing sequences. Knowing this information can help interpret results, especially when doing cross-species comparisons. A robust profile like PLEK2 that is represented by many species and has close to a 1:1 gene-to-species ratio gives us confidence that we have quantified a “real” transcriptomic feature.
9. Ensure that you are on the “Select Significant Features” page in ExpressAnalyst. Click “Proceed” to go to the “Analysis Overview” and then click “Volcano Plot” to perform functional analysis.
10. Go to the “Enrichment Analysis” panel and change the query to “Sig. Up,” the database to “GO:BP” and click “Submit.” This will perform ORA on the list of DEGs with positive log2FCs. After examining the results, change the query to “Sig. Down” and click “Submit” again to perform ORA on the list of DEGs with negative log2FCs (Fig. 30).
One of the main strengths of using Seq2Fun framework vs using de novo/newly assembled transcriptomes is comprehensive functional annotation (Liu et al., 2021). By aggregating information across over 600 species, there are high-quality and high-density KEGG and GO term annotations. This analysis shows that genes related to energy generation and muscle development are downregulated while genes related to more fundamental wound healing and tissue regeneration are upregulated. These themes were also reported by the original publication, which used a combination of de novo and official transcriptomes to conduct the analysis.
11. Click the “Analysis Overview” at the top navigation bar, and then click “Dimension Reduction.” Please be patient as it may take up to a minute to load.
12. The “Settings” panel in the top left of the screen allows users to adjust the appearance of the 3D scatterplot. Change the background to white by clicking the box next to “Background” (#1 in Fig. 31A). Click the box next to “Floor” and change it to a lighter gray. Select the checkbox next to the box (#2 in Fig. 31A). Select the checkbox next to “Shadow” (#3 in Fig. 31A).
13. By default, the points are colored according to their “Time” annotation. We want to distinguish points based on “Species” annotation. First, go to the “Overview” panel and change the “Select meta-data” dropdown from “Time” to “Species” (#4 in Fig. 31A) and click “Update.” Go to the vertical toolbar inside of the plot view and click the icon that looks like an oval around some points (#5 in Fig. 31A). In the dialog, make sure that the “Selected meta-data” is “Species” and click “Submit.” Go back to the “Selected meta-data” dropdown in the “Overview” panel and change it back to “Time” and click “Update.”
14. Go to the vertical toolbar and click the icon that looks like two arrows pointing to each other (#6 in Fig. 31A) to change the view from the scores plot to the loading plot (Fig. 31B). If the ellipsoids are still present in the loading plot, click the icon that looks like an oval around some points and click “Remove” in the pop-up dialog to remove them. Click one of the points on the extreme range of the sphere to display the feature plot (Fig. 31C).
Each set of principal component scores are calculated as a linear combination of all gene expression values. For example, PC1 = a*gene1 + b*gene2 + … + x*geneN. The coefficients (i.e., a, b, … x) are called loadings. Examining the PC loadings can help us interpret the PCA scores plot. The genes with greater magnitude loadings are the genes that contribute more to the patterns that we see in the sample scores. For example, the samples appear to be separated according to time point along PC2 (Comp.2, Fig. 31A). Therefore, we expect that the genes with high magnitude loadings will have different expression values between the two time points. The loadings plot (Fig. 31B) shows the PC1, PC2, and PC3 loadings for each gene. We can interpret the position of each gene as “pulling” the samples towards that direction. The genes on the outer limits of the sphere of points are the ones with the greatest magnitude loading scores.
15. Navigate back in your browser to return to the “Analysis Overview,” find the heatmap and click “GSEA.” Leave “Fold-change” and “Multi-level (recommended)” selected in the GSEA Parameter Setting dialog and click “Proceed.” Please be patient as it may take up to a minute to load.
The bottom left “Enriched Pathways” panel allows users to re-run the GSEA analysis with different functional libraries, and then displays the results sorted based on the p-values. Expression values from the selected pathways in the “Enriched Pathways” panel are displayed in a heatmap in the center “Expression patterns” area. The GSEA enrichment plot for this pathway is displayed in the top left “Enrichment Plot” panel. The top horizontal toolbar can be used to style the heatmap appearance.
16. By default, GSEA has been conducted with KEGG pathways (Fig. 32). The most significant pathway is selected. Go to the top Heatmap Toolbar and set “Cluster features” to “Ward's method” to cluster genes based on the similarity of their expression profiles. Select different pathways to see their expression profiles.
The significant pathways identified are consistent with the ORA functional results from the volcano plot analysis. GSEA results with positive enrichment scores reflect pathways that are enriched in genes with positive log2FCs and are mainly related to the immune system. Conversely, results with negative enrichment scores are pathways that are enriched in genes with negative log2FCs and are mainly related to energy generation and muscle cell types.
17. Click the checkbox next to “TNF signaling pathway” to view this pathway.
The “TNF signaling pathway” has a positive enrichment score, meaning that it is enriched in genes with positive log2FCs. From the heatmap, we see that this is indeed the case with predominantly red cells on the right side (time24) and predominantly blue cells on the left side (time0).
The “Enrichment Plot” shows how the GSEA enrichment score was calculated. Genes were sorted from high log2FC to low, from left to right. The vertical black lines along the x-axis show the position of each pathway gene. Lines closer to the left indicate genes with higher log2FC values while those to the right indicate genes with lower log2FC values. The GSEA algorithm moves from left to right along the list of ranked genes. Each time that it encounters a pathway gene with a positive log2FC, the running score (green line) increases. Each time the gene is not a pathway gene, the running score decreases. A pathway “hit” changes the running score more than a pathway “miss.” The returned enrichment score is the furthest deviation of the running score from zero, indicated by the dashed red line.
18. Go to the navigation bar at the very top of the page and click the “Downloads” link to view and download the results.
This protocol demonstrates how powerful ExpressAnalyst is for analyzing RNA-seq data from non-model species. By leveraging the functional annotations of hundreds of species, EcoOmicsDB and Seq2Fun enable immediately investigation of the biological processes perturbed in species with no reference transcriptome (Liu et al., 2023).
Basic Protocol 6: PROTEOMICS AND MICROARRAY DATA PROCESSING AND NORMALIZATION
Basic Protocols 1 to 4 covered the steps for a standard RNA-seq data analysis workflow. Proteomics and microarray data are both intensities rather than counts and should be processed differently. However, once the intensity data are properly normalized, the statistical and functional analysis pipelines introduced for RNA-seq data can be used to analyze the proteomics and microarray data.
Even though proteomics and microarray data are both intensity values, they differ in some key respects. Proteomics data generated using liquid chromatography mass spectrometry (LC-MS) tend to have more missing values than microarray data, and the workflows usually have an imputation step to avoid filtering out a high number of features (Lazar et al., 2016). The dynamic range of LC-MS is greater than that of microarray data. For these reasons, specialized normalization methods have been developed for each data type. In this protocol, we will briefly go through data upload, annotation, filtering, and normalization for a proteomics and microarray dataset. After these steps, the methods in Basic Protocols 2 to 4 can be used to perform statistical and functional analysis.
Necessary Resources
Hardware
- A computer with internet access
Software
- An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Files
- None
1. Go to the ExpressAnalyst home page (https://www.expressanalyst.ca) and click the “Tutorials” link at the top menu bar to visit the tutorial page. Scroll to the bottom of the page and find the “Dataset for the ExpressAnalyst Current Protocol” data section. Download the two text files “microarray_example.csv” and “proteomics_example.csv” from the download link under “Basic Protocol 6 (microarray and proteomics).” These data were previously published in two different studies (Su et al., 2002; Wigger et al., 2021).
2. Go to the ExpressAnalyst home page and click “Start Here” to view the module overview. Find the “Statistical & Functional Analysis” section and click the “Start Here” button underneath the single table input type.
3. Start with the microarray example data and select “human” for organism, keep the analysis type as “Differential Expression,” select “Intensities (microarray)” for the data type, and select “Affymetrix Human Genome U95 (chip hgu95a)” for the ID type. Select the “microarray_example.csv” file, check the “Metadata included” box, click “Submit” and then “Proceed.”
The microarray data type has many more options compared to RNA-seq or proteomics data because each chip version has its own annotation system. ExpressAnalyst supports commonly used microarray formats; however, there are many custom arrays that we do not have the annotation files for. If your microarray is not found in the list of ID type options, you must convert the microarray-specific IDs into a common format, such as Entrez, RefSeq, Ensembl, or Official Gene Symbol before uploading.
4. Examine the “Omics data overview” and the diagnostic plots at the bottom of the page. When you are done, click “Proceed.”
We can see that ∼93.3% of the probe IDs are matched. The Entrez ID system is updated frequently by NCBI as higher quality genome assemblies and annotations are continually available. Over time, some IDs are discontinued, hence we expect a lower percentage of older microarray probe sets to map to the current Entrez ID system each year. The boxplot and density plot indicate that the data are not normalized, as the expression values are highly right-skewed and go up to 10,000.
5. Leave the filtering settings as the defaults. Please refer to Basic Protocol 1 step 9 for an explanation of the filtering step; the principles are the same for microarray and RNA-seq data. There are five different normalization options. Select “None” and click “Submit,” to see what the data looks like without any normalization.
On a microarray, each targeted transcript can have multiple probes. Information must be summarized across probes using custom software developed specifically for that microarray format before any statistical analysis. These microarray-specific methods may apply some normalization or transformation methods, and it is important that you understand what was done to your dataset. In general, we do not want to perform the same normalization twice. Here, we can see that the data are not on a log scale.
6. Select “Quantile Normalization” and click “Submit.”
Quantile normalization assumes all samples have same distribution regardless of sample class (Bolstad et al., 2003). This is motivated by a quantile-quantile plot which shows that the distribution of two data vectors is the same if the plot is a straight diagonal line. First, the algorithm re-orders all expression values from low to high within each sample, while remembering the original identity and location of each value. Then, each individual value is replaced by the median of its row. Finally, normalized values are returned to their original locations. This has the effect of enforcing an identical distribution across all samples, and the differences of individual samples are on their gene orders.
In Figure 33A, you can see that the distribution is the same in the boxplot, however the data are not on a log scale. Quantile normalization typically needs to be combined with a log-transformation prior to statistical analysis. We include the option here on its own because sometimes data are already log-transformed prior to uploading to ExpressAnalyst.
7. Select “Log2 Transformation” and click “Submit.”
This method is a simple log2 transformation of all expression values. The boxplot in Figure 33B shows that the data are now on a different scale compared to Figure 33A, and there are some differences in distributions across samples (the boxplots are not identical).
8. Select “Variance Stabilizing Transformation (VSN)” and click “Submit.”
As we discussed in Basic Protocol 1, transcriptomics data typically has a mean-variance trend in that, after filtering out features close to the detection limit, there are higher variances at lower expression values. While most differential expression algorithms directly model and account for this trend, sometimes we still wish to adjust the data to remove this trend, for example if we are going to use the data for hierarchical clustering.
VSN is a well-established statistical method that can stabilize the variance of microarray data across the full range of the data and make the data more symmetric (Huber et al., 2002). The mean-variance plot in Figure 33C shows that the clear mean-variance trend visible in Figure 33B has been completely removed, as this has now become a nearly flat line.
9. Select “VSN followed by Quantile Normalization.”
This method is exactly as described: VSN has adjusted for the mean-variance trend, and then quantile normalization was performed on the results. You can see that the distributions are identical, on a log scale, and the mean-variance trend has been removed (Fig. 33D). We could proceed to downstream statistical and functional analysis, but instead we will return to the upload page and process a proteomics dataset.
10. Go back to the ExpressAnalyst home page and click “Start Here” to view the module overview. Find the “Statistical & Functional Analysis” section and click the “Start Here” button that is underneath the single table input type.
11. Select “Human” for the organism, “Intensities (proteomics)” for the data type, leave the analysis type as “Differential Expression,” and set “Official Gene Symbol” for the ID type. Select “proteomics_example.csv” for the omics data file and check “Metadata included.” Click “Submit” and “Proceed.”
12. View the “Omics data overview.”
There are only 2958 annotated features in this dataset. While proteomics coverage has increased in recent years, it is still common to have many fewer features in proteomics compared to transcriptomics datasets. This will be important to remember during the data filtering step.
13. Navigate to the “Missing value estimation” tab. Leave “Remove all features with >50% missing values” checked. Select the last radio button “Estimate missing values using KNN (feature-wise).” Click “Process” and then “Proceed.”
Most downstream statistical analyses do not tolerate missing values. There are four different general strategies for dealing with missing values (Fig. 34). The most conservative approach is to filter out any feature that has a missing value for any sample (Fig. 34A). This tends to lose a lot of information. The other three approaches aim to impute missing values, which means using the other available data points to generate reasonable estimates for the missing values.
Missing values can be either “missing at random,” where the missing values are due to random human or technical error, or “missing not at random,” where there is a consistent, systematic explanation (Lazar et al., 2016). For “missing at random,” we assume that the true values of the missing values will be distributed normally around the central value of that feature, thus it makes sense to replace missing values with the mean or median of that feature (Fig. 34B). Most proteomics missing values are “missing not at random,” where the protein abundance was not measured because it was below the limit of detection. In this case, it makes sense to replace missing values with a low value, for example a simple approach is to use 1/5 of the lowest value of each feature (Fig. 34C).
The methods in Figure 34B and C replace all missing values for the same feature with the same value. This is conceptually straightforward; however, it can cause problems with downstream statistical analysis when there are many missing values for the same experimental group. For example, the p-value for DEA depends on the standard error of each of the groups being compared, where lower standard errors result in more significant p-values. If there are multiple missing values for the same experimental group, the standard error will be artificially lower since each will be replaced with the exact same value, inflating the significance of the p-values. We can avoid this by using stochastic imputation methods that use the entire dataset to estimate distributions of reasonable values. The imputed values are drawn randomly from the estimated distribution, avoiding the problems with statistical inference. One downside of these methods is that since they have a random element, you will get slightly different imputed values each time that you analyze your data, resulting in slightly different visualizations and lists of differential features/pathways. For these reasons, it may be preferable to use the methods in Figure 34B and C if there are very few missing values, and in Figure 34D when there are more.
14. Since we already have relatively few features (<3000), set both the abundance and variability filters to 0. Select “None” for the normalization method and click “Submit,” to see what our unnormalized data look like.
Based on the boxplot and mean-variance plot (Fig. 35A), it appears that the data are already on a log scale and have already undergone some form of VSN normalization. However, the sample-level distributions have some variability hence the data could benefit from either “Normalization by median” or “Quantile normalization.”
15. Select “Normalization by median” and click “Submit.”
The VSN and Quantile Normalization methods are the same as explained for the microarray data above. In “Normalization by Median,” each intensity value is divided by the median value for that sample. This accounts for systematic differences across samples, for example higher or lower total MS signal related to different sample masses. After performing median normalization, we see that the medians of all boxplots are aligned (Fig. 35B). The two regression methods assume that the variability is dependent on the mean intensity in a linear (linear regression) and non-linear (local regression) relationship, respectively (Välikangas et al., 2018). The results of the regression are used to adjust the data to remove this dependence.
16. Click the “Downloads” link on the upper navigation tracker. Stay on the “Files & Scripts” tab and download the “data_normalized.csv” file.
This pipeline used a stochastic missing value imputation method (KNN), which means that each time it is executed, we will get slightly different results. This may be undesirable at times, for example, if you have already prepared figures for a manuscript and just want to go back and check something else in the data. To avoid this, we can download the normalized data and upload this file for all future analyses. If we turn off filtering and select “None” for normalization, we will be able to exactly reproduce the analysis.
Basic Protocol 7: PREPARING MULTIPLE GENE EXPRESSION TABLES FOR META-ANALYSIS
As the cost of collecting transcriptomics data decreases and public data repositories grow quickly, it has become much easier these days to find multiple datasets testing the same hypothesis. The statistical power of individual transcriptomics DEA is quite low for most DEGs. In addition, analysis results are sensitive to even moderate outliers, which can greatly increase the number of false positives and false negatives. Meta-analysis can help overcome this issue by integrating results across multiple independent datasets to prioritize genes with consistent evidence for differential expression. In this protocol, we introduce how to use ExpressAnalyst for initial processing of multiple expression tables in preparation for the statistical and functional meta-analysis.
This protocol uses three datasets on helminth infections. Helminths are parasitic worms that impact ∼2 billion people worldwide (Oyesola & Souza, 2022). Here, we analyze three published microarray gene expression datasets of the liver tissue from both infected and control mice (Zhou et al., 2016).
Necessary Resources
Hardware
- A computer with internet access
Software
- An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Files
- None
1. Go to the ExpressAnalyst home page and click “Start Here” to access the module overview. Find the “Statistical & Functional Analysis” section and click the “Start Here” button located underneath the multiple gene expression table input type.
2. Click “Try Examples.” Clicking “Yes” will upload and process all three tables. However, we want to demonstrate these steps, so download and save each of the files. Open each of the three files and examine its content. All datasets must be from the same species and must be described by the same metadata labels, otherwise it will not be possible to compare results.
In general, datasets collected for meta-analysis should be as homogenous as possible. Although it is possible to analyze gene expression tables of very different technologies (i.e., microarray intensities and RNA-seq counts), the differences introduced by technologies and associated data processing pipelines could be too large to overcome during meta-analysis.
3. Click the “Data Upload” icon at the top of the page and choose the three files that you just downloaded. Click “Upload” and then “Done!”
4. We first need to perform annotation, missing value estimation, filtering, and normalization for each of the three uploaded datasets. In the “Data annotation” section, set the data value type to “Normalized values,” set the data type to “Microarray data (intensities),” set the organism to mouse, and the ID type to “Entrez ID” (Fig. 36). Click “Submit.”
Users must obtain these details during the process of data collection for meta-analysis. The information is usually available in the data repositories. When in doubt, please contact the original authors or data depositors for clarifications.
5. This dataset has no missing values, so it does not matter what the parameters are for the “Missing values” section. Leave the defaults and click “Submit.”
6. These datasets are already filtered and normalized, so in the “Filtering & normalization” section, set both the variance and abundance filters to 0 and leave “None” for the data transformation. Click “Submit.” Now the data status above the data processing table should have changed from “Incomplete” to “Finished.”
Here, the data are already normalized, so these parameters are easy to choose. In general, we suggest first analyzing each dataset individually as described in Basic Protocols 1 to 3. This will allow you to visualize the data before and after filtering and normalization, and to understand the basic trends individually before trying to compare across multiple datasets. The visualizations will help identify potential technical problems and to choose appropriate processing parameters during the meta-analysis.
7. Find the dropdown menu above the data processing table and change the “Currently selected data” to the second dataset. Notice how this dataset has an “Incomplete” status. Repeat steps 4 to 6 for the second dataset, then select the third dataset and repeat steps 4 to 6 again. Click “Proceed.”
8. A dialog will pop up telling us that the data passed the integrity check. Click “Next” to proceed to the “Data Quality Check” page.
The data quality check has the same components as explained for the single omics analysis (Basic Protocol 1); however, it has been modified to support multiple omics tables. There is a brief summary of the uploaded data, and data from all datasets are included in the plots and metadata table.
9. There is a clear batch effect with a strong pattern of separation according to technical platforms along PC1, so we will use ComBat for batch effect correction (Johnson et al., 2007). Check the box next to “Adjust study batch effect (Combat)” and then click “Update” (Fig. 37).
In Figure 38A, we can see that the Illumina and Affymetrix arrays are separated along PC1, which accounts for 97.5% of the variability in the integrated data. After performing ComBat, we see that the samples now separate by infection status, with controls on the left and infected samples on the right across all three studies (Fig. 38B).
ComBat first filters out any gene that is missing from >80% of the combined samples. Like other batch effect correction methods, it assumes that the true mean and standard deviation of each gene is the same across datasets. A key difference between ComBat and other methods is that it pools information across genes using an empirical Bayes approach to estimate the batch effect corrections, making it more robust for datasets with small sample sizes.
10. We have now finished the data processing for meta-analysis. Click “Proceed” to move to the statistical and functional analysis, which will be covered in Basic Protocol 8.
The public ExpressAnalyst web server currently allows users to upload and process a maximum of 10 gene expression datasets, with maximum of 50 MB per dataset. Large-scale meta-analysis of multiple transcriptomics datasets is a complex process and can be computationally very expensive. The task is more suitable to be performed by dedicated bioinformaticians using our ExpressAnalystR package rather than using the web interface.
Basic Protocol 8: STATISTICAL AND FUNCTIONAL META-ANALYSIS OF GENE EXPRESSION DATA
The objective of meta-analysis is to identify genes that are consistently dysregulated across multiple datasets (Xia et al., 2013). There are two general approaches for this: 1) to directly merge the datasets and then perform a single DEA of the merged data, or 2) to perform DEA on each dataset separately and then combine their summary statistics (i.e., p-values or effect sizes). ExpressAnalyst supports both approaches, although in general we do not recommend direct merging unless the datasets are very similar e.g., all measured using the same platform and protocol. After statistical analysis, functional analysis can be performed on the list of DEGs identified during meta-analysis using the same set of visual analytics tools introduced in Basic Protocols 3 to 5, plus an additional tool specifically for comparing results across multiple datasets. Here, we continue analyzing the processed data from Basic Protocol 7 to perform statistical and functional meta-analysis.
Necessary Resources
Hardware
- A computer with internet access
Software
- An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Files
- None
1. This protocol requires Basic Protocol 7 for meta-analysis data processing and normalization. If you have not just completed Basic Protocol 7, please do so. You should be on the “Identifying significant features with complex metadata” page to begin this protocol. Make sure that “Control” is chosen as reference group and change “Contrast” to “Infected” and leave the rest as default. Click “Submit.”
This is the same interface as the “Complex Metadata” tab on the differential expression page for the single gene expression table workflow. The statistical concepts behind this interface were discussed in detail in Basic Protocol 2; please read this protocol for more details.
2. A graphical summary of the results is generated for each dataset. Scroll down to the “Omics Data #1” panel and click one of the points to view the corresponding feature plot (Fig. 39B). Click the table icon in the top right of the “Omics Data #1” panel to view a table of the summary statistics (Fig. 39C). Once you have finished examining the results, click “Proceed.”
After DEA, the left panel summarizing each of the datasets is updated with the number of DEGs (Fig. 39A). Here, we see that there are many DEGs for the two Illumina datasets (2961 and 2876) and only a few for the Affymetrix dataset (33). The results from each of the datasets are displayed in a panel (Fig. 39B). The x-axis is the -log10(p-value) (larger values mean they are more significant) calculated using a model with only the primary metadata (no covariates). The y-axis is –log10(p-value) calculated with the complete specified model. Normally, this plot would show how including the covariates in the model changed the p-values, for example points in the top left (y-value > x-value) are scenarios where the p-value is more significant after including covariates, whereas points in the bottom right (y-value < x-value) are where the p-value became less significant after the adjustment. In this case, we only have one metadata variable (infection status), so we were unable to include any covariates in the model hence the x and y-values are exactly the same.
3. The next step is to integrate the DEA results across datasets. Look over all the integration strategies. Go to the first method (“Combining P-values”), leave the method as “Fisher's method,” and click “Submit” and then “Proceed.”
ExpressAnalyst currently offers four different categories of integration methods: 1) combining p-values; 2) combining effect sizes (log2FC divided by standard deviation); 3) vote counting, which simply counts the number of datasets that a gene is differentially expressed and keeps those genes that pass a specified count threshold; and 4) direct merging normalized data followed by a single DEA. The first two are recommended, as these are the most robust integration methods that make use of summary statistics that are less susceptible to study-specific effects. The “Combining P-values” method should result in 4088 DEGs after combining statistics across all three datasets.
4. Click “Previous” to go back to the meta-analysis page. Go to the second strategy (“Combining Effect Sizes”). There are two different methods for performing this integration, using a “Fixed Effects Model,” or using a “Random Effects Model.” Click the “Cochran's Q-Test” button.
Cochran's Q test is the traditional test for heterogeneity in meta-analyses. The Q value is calculated as the weighted sum of squared differences between individual study effects and the pooled effect across studies. When the estimated Q values have approximately a chi-squared distribution, fixed effect model assumption is appropriate. In this case, the data deviates from a 1:1 relationship (red dashed line), we should choose to model our data with a random effects model.
5. Select “Random Effects Model” from the dropdown menu in the “Combining Effect Sizes” section. Leave the p-value threshold as 0.05. Click “Submit” and “Proceed.”
6. There are 1703 significant results, a good number for visualization and functional analysis. Click “Proceed.”
There are three columns showing summary statistics of individual DEA results followed by the combined effect size and p-values. By default, summary statistics show log2FCs, but we can switch to view p-values if preferred (Fig. 40).
7. The analysis overview has the same tools that were introduced in Basic Protocols 3 and 4. Click “Enrichment Network.” In the pop-up, leave “Meta-analysis results” selected and click “Proceed.”
8. In the “Enrichment Analysis” pane, change the “Type” to “GSEA” and click “Submit.” There are many significant results, and the labels are all overlapping and difficult to read. Find the horizontal toolbar and change the “Node” dropdown to “Label.” Go to the “Display” tab in the “Node Label Customization” dialog, change the dropdown to “Unlabel all nodes” and click “Submit” (Fig. 41).
9. Go to the results table. Select the top 10 most significant pathways. Change “Label” to unspecified and then back to “Label” to access the pop-up again. Change the dropdown to “Label selected nodes” and click “Submit.” You will see that all the selected nodes are now labeled. However, these nodes are colored with the same highlight color. To restore their default color styles based on expression profiles, go to the results table, check, and then immediately uncheck the “Select all” box to remove all the checked boxes (Fig. 41).
10. The network is now much more interpretable and easier to read, with only the key nodes labeled. Some of the labels are still overlapped by other nodes. Click and drag individual nodes until every label is visible (Fig. 42).
11. Navigate back to the “Analysis Overview” and click the “Upset Diagram” button. Keep the defaults: all three datasets included but leave the “Meta-analysis features” unchecked. Click “OK.”
12. The diagram allows us to explore DEG overlap across datasets (Fig. 43). Hover your mouse over any bar to see how many of those genes are present in the other interactions.
There are only 16 consensus DEGs across all three datasets (the last column with all three circles filled in). Many of the DEGs are shared between the two Illumina datasets (2961 and 2876 DEGs in E-GEOD-59276 and E-GEOD-25713, respectively; 2050 are shared).
13. Click the last column of dots corresponding to the DEGs that were present in all datasets. This will display the gene names in the bottom left “Feature Members” panel. Now go to the “Enrichment Analysis” panel and click “Submit.” This will perform ORA with the KEGG database using the genes in the “Feature Members” panel.
14. Go to the navigation bar at the very top of the page and click the “Downloads” link to view and download the results.
Basic Protocol 9: FUNCTIONAL ANALYSIS OF TRANSCRIPTOMICS SIGNATURES
Differential expression analysis (DEA) produces lists of differentially expressed genes or proteins, also known as transcriptomic or proteomic signatures. These lists can be saved, and then uploaded to ExpressAnalyst directly through the list upload in the “Statistical and Functional Analysis” module. DEA results are also commonly available as supplementary files in the published literature, providing another source of transcriptomic signatures. Since the data are small relative to an omics data table, we can easily upload multiple lists simultaneously to visually compare them. We can also perform functional analysis on the list of uploaded genes. In this protocol we introduce the different list formats that are accepted by ExpressAnalyst, and then demonstrate how to compare multiple uploaded lists with a heatmap.
Necessary Resources
Hardware
- A computer with internet access
Software
- An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Files
- None
1. Go to the ExpressAnalyst home page and click “Start Here” to view the module overview. Find the “Statistical & Functional Analysis” section and click the “Start Here” button underneath the list input type.
2. Go to “Try Examples,” keep “Gene list 1” selected, and click “Submit.” This will populate the data input with a list of IDs, along with the appropriate species and ID type.
This is an example of a list that is uploaded with log2FC values (Fig. 44A). The log2FC values are optional, but if included, are used to annotate some of the results, for example the gene nodes in the enrichment network bipartite view. Here you can see that there is a header (starting with “#”) specifying the ID and log2FC columns. This header is optional; ExpressAnalyst will still recognize the data without the header.
3. Click “Try Examples” again, select the last example data called “Multiple Lists,” and click “Submit.”
This is an example with multiple lists. If you scroll through the text area with the IDs, you can see that two of the lines are “//.” This is used to separate different lists, so in this case, there are three lists separated by two “//” breaks. Note that in this example, there is no header (row starting with “#”) and no log2FC values.
4. Click the “Upload” button beneath the text input area, and then click “Proceed.”
This is the same overview as for the other workflows, with some visual analytics tools disabled if they require expression values for each sample, or p-values. For lists, we are still able to use the enrichment network, ridgeline plot, ORA heatmap (a different format compared to the expression table upload), and the upset diagram.
5. Find the heatmap icon and click the “ORA” button. This will show a new type of heatmap, designed to compare multiple lists (Fig. 45).
The overall design of the heatmap interface is the same as for an expression table upload. Please see Basic Protocol 4 for a detailed introduction to the interface components. The main difference is that instead of visualizing expression values, the heatmap is used for visualizing list overlap. If a feature does not appear in a specific list, the respective cell is gray. If it is in the list, it has a color on a scale between yellow and red. If a feature is in only one list, the color is yellow. If it is in all lists, it is red. For frequencies in between none and all, it is given an orange color, where the exact shade is determined by the number of uploaded lists.
6. In the “Overview” panel on the left, use your mouse to drag-and-select the rows containing red and orange cells. Those selected rows will show in the “Focus View.” Then, go to the “Enrichment Analysis” panel on the right and click “Submit.”
These results are the pathways that are enriched in the transcripts that are present in multiple uploaded lists as shown in the Focus View (Fig. 45). We can think of this as a shortcut to the functional meta-analysis performed in Basic Protocol 8.
7. Navigate back to the “Analysis Overview,” find the network icon and click the “Enrichment Network” button. A dialog with a dropdown will appear. We have the option of choosing one of the three uploaded lists; leave “datalist 1” selected and click “Proceed.”
The interface is almost identical to that of the “Enrichment Network” analysis that we performed after processing an expression table instead of a list. The only significant difference is that GSEA is not available for functional analysis of lists. The purpose of including this step in Basic Protocol 9 is to highlight the functional analyses that we can perform on lists. We direct users to the Basic Protocol 3 for the concepts behind functional analysis and detailed steps on using the enrichment network interface.
8. Click the “Downloads” button to view all the results from this protocol.
Basic Protocol 10: DOSE-RESPONSE AND TIME-SERIES DATA ANALYSIS
As the cost of acquiring transcriptomics datasets decreases over time, we see more datasets that measure groups of samples along a continuous dimension, such as chemical doses or time points. A statistical method called “dose-response analysis” was developed by the field of toxicology to identify the concentration at which a biological assay responds to chemical exposure (Thomas et al., 2013). Dose-response experimental designs typically include a control group (dose = 0) and at least four different dose groups, typically with the same number of replicates in each group. To perform transcriptomics dose-response analysis, the data are processed and normalized according to standard protocols, and then differential analysis is used to identify genes that have a relationship with dose. All genes that pass the DEA filters are used for dose-response curve fitting, in which a suite of linear and non-linear curves is fitted to the expression of each gene, and the best fit model for each gene is kept. Next, the curve is analyzed to determine the precise concentration at which the fitted curve departs from the expression values in the control group (called the gene benchmark dose, or BMD). The collection of gene BMDs can then be analyzed at the pathway or whole-transcriptome level to determine the concentration at which specific pathways respond, or the concentration at which we observe a robust transcriptomic response. ExpressAnalyst uses the FastBMD implementation of dose-response analysis (Ewald et al., 2021).
While this method was developed for analyzing dose-series data in a toxicology context (National Toxicology Program, 2018), the same statistical approach can be used to analyze data measured along any continuous gradient, for example time-series or even temperature. In this protocol, we introduce BMD analysis with a chemical exposure dataset. Rats were exposed to five concentrations of bromobenzene for 2 weeks, at which point transcriptomic data were measured with microarrays in liver tissue (Thomas et al., 2013).
Necessary Resources
Hardware
- A computer with internet access
Software
- An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Files
- None
1. Go to the ExpressAnalyst home page and click “Start Here” to view the module overview. Find the “Statistical & Functional Analysis” section and click the “Start Here” button that is underneath the single table input type.
2. Go to “Try Examples” and select the “Bromobenzene” example data. Click “Submit” and then “Proceed.”
We are using a built-in example data, so we do not need to specify any parameters during upload. When uploading your own data, make sure to change the analysis type from the default “Differential Expression Analysis” to “Time series/Dose response.” The metadata must also be formatted in a specific way. Make sure to input the actual concentration values, e.g., in the example data the metadata looks like Figure 46. If the actual numerical values are not used, ExpressAnalyst will give an error. In addition, you should have at least two replicates for each dose group because the dose/time group standard error is used in various steps of the analysis.
3. The “Omics data overview” shows that there is a control group (dose = 0) and five dose groups (doses = 25, 100, 200, 300, and 400). When you are done viewing the data, click “Proceed.”
These doses are on a linear scale. It is more common to see concentration series on a log or semi-log scale. There are options in later steps to change the scale for visualization, which is typically necessary for log and semi-log scales. This dataset is more representative of other continuous gradients that are commonly measured such as time, which are unlikely to be on a log scale.
4. This dataset was already normalized. Leave the filtering defaults, select “None” for normalization, and click “Submit.” When you are done viewing the PCA and box plots, click “Proceed.”
The PCA plot shows that a substantial amount of variability within the dataset is related to dose, with samples going from low dose to high dose along PC1 (Fig. 47). This is a good sign for dose-response analysis.
5. In dose-response analysis, DEA is a pre-filtering step, used to remove genes that do not change over the measured conditions prior to the computationally intensive curve fitting step according to a standard design, hence the interface is greatly simplified. Make sure that “DOSE” is selected as the dose/time factor and that “0” is selected as the control condition. Leave the “Statistical method” as “Limma.” Click “Submit” and then “Proceed.”
6. There should be 2018 significant features. Change the “Adjusted p-value” from 0.05 to 0.01 and click “Submit” to update to 1198 significant features. Click the “Graphical Summary” to view feature plots of the top few rows (Fig. 48). Change the “Selected Comparison” dropdown to DOSE_300 vs. DOSE_0. You should see that the log2FC column changed, but that the p-values stayed the same. Click “Proceed.”
In Basic Protocol 2, we explained how DEA is conducted by specifying metadata variables to include in a linear model, fitting the model to the gene expression data, and then extracting moderated t-statistics and p-values related to specific terms from the fit object. If we do not specify a single term, limma returns moderated F-statistics and p-values, similar to an ANOVA test, to assess whether there are significant differences between any of the metadata groups.
Curve fitting is computationally intensive. It can take ∼5 min to fit curves to 2000 genes. To manage server load, the ExpressAnalyst public server currently allows a maximum of 2000 genes for curve fitting. If there are many significant features and curve fitting is too slow, you can adjust the p-value and/or fold-change thresholds to reduce the number of results to between 500 and 1000.
7. Leave the defaults (Fig. 49A) and click “Submit.” This is quite computationally intensive, hence this step could take several minutes to complete.
There are ten different models listed in the “Fit Models” section. It is recommended to select all except for the higher-order polynomials (Poly3 and Poly4), which should only be used if you expect to see a non-monotonic response. The lack-of-fit p-value is used to filter out curves prior to calculating the gene BMD. It has the opposite interpretation as the p-values used in DEA, in that we filter out curves that have p-values below this cut-off.
The final list of gene BMDs is determined in several steps. First, the mean and standard deviation of the control samples is calculated. Then, the concentration at which the fitted curve first surpasses the mean ± 1 standard deviation is identified. We can change the number of standard deviations considered by adjusting the benchmark response (BMR) factor. The 95% confidence intervals for the gene BMD are calculated, and if the ratio between the upper and lower limit is >40, we filter out that result. Finally, any gene with a BMD higher than the highest dose is also filtered out.
The density plot (Fig. 49B) shows the distribution of all gene-level BMDs. It is annotated with several measures of the “transcriptomic point-of-departure” (tPOD), which are aggregated measures of the concentration at which we observe a robust transcriptomic response. The bar plot (Fig. 49C) shows the distribution of models used to calculate the gene BMDs, including both those that passed all the filters (teal) and those that were filtered out (dark gray).
8. Navigate to the results table and click the button in the “View” column for several of the genes.
The individual gene plots show each of the expression values (black points), the fitted curve (blue line), the gene BMD (solid red line) and the lower and upper 95% CI of the gene BMD (dashed red lines). The gene plotted in Figure 50 has such a strong response that the CI is extremely narrow and the dashed and solid lines are nearly on top of each other.
9. Click “Proceed” to go to the “Analysis Overview.” Click the “Accumulation Plot” button.
10. By default, pathway analysis was performed with the KEGG database.
Pathway BMD analysis starts by performing overrepresentation analysis (ORA) on the list of genes with gene BMDs. For more details on ORA, please see Basic Protocol 3. A pathway-level BMD, or the concentration/time where that pathway is activated, is estimated as the median BMD of all gene BMDs in that pathway.
The pathway analysis results are shown in the left “Functional Enrichment Analysis” panel. The accumulation plot of the top five most significant pathways is displayed in the center area. Clicking a pathway row displays all gene-level results for that pathway in the “Current Gene Selection” panel on the top right. Clicking a gene, either within the “Current Gene Selection” panel or within the accumulation plot, displays the fitted curve in the bottom right.
11. There is one pathway with a very high number of hits (“Metabolic pathways”) that makes it difficult to see the differences between the other pathways. Click the “Metabolic pathway” row in the results table to remove it from the plot. Now the other pathways become easier to see (Fig. 51).
12. Click the colored box next to the “Database” label in the top left (Fig. 51) and choose a new color. Next, click the next most significant pathway that is not highlighted to add an accumulation plot for that pathway in the selected color. Repeat these two steps to add a few more pathways to the diagram.
Accumulation plots (also called cumulative distribution plots) are powerful tools for visually comparing many distributions to each other. Here, we use them to compare the distributions of BMD values across different pathways. The x-axis reflects the continuous metadata that we used for curve fitting, for example concentration values if this was a dose-response analysis or time points if this was a time-series analysis. The y-axis is the rank of the BMD value within that pathway.
Accumulation plot is very informative. First, the maximum height of each accumulation plot corresponds to the number of hits within that pathway. Pathways with more gene hits will have a taller curve because they ‘accumulate’ more points. Second, areas of the accumulation plot with a steep incline correspond to concentration/time ranges with a high density of gene BMD results. For example, the orange curve (“Glutathione metabolism”) increases in height relatively quickly between concentration values of 10 to 100 mg/kg and then slowly between 100 and 300 mg/kg. Third, the relative location along the x-axis shows which pathways are generally activated at lower vs higher concentrations. The orange pathway is clearly activated at lower concentrations than the other plotted pathways, for example the royal blue “Steroid hormone biosynthesis” pathway.
13. Click the underlined pathway name in the “Current Function” panel (below the enrichment result table) to generate a heatmap summarizing all fitted curves within that pathway. Hover your mouse over a point in the accumulation plot and click to view the fitted curve for that gene.
14. Click the “Downloads” link on the top navigation tracker to view and download all the results generated during the analysis session.
Basic Protocol 11: RNA-seq READS PROCESSING AND QUANTIFICATION WITH AND WITHOUT REFERENCE TRANSCRIPTOMES
The first step in an RNA-seq analysis starts with quantification of the raw reads from FASTQ files. This involves performing quality control, trimming low-quality reads, and aligning cleaned reads to a reference genome or transcriptome (Conesa et al., 2016). This is the most computationally intensive part of a basic RNA-seq analysis pipeline, and ExpressAnalyst provides two options: users may upload compressed FASTQ files (.fastq.gz) to the ExpressAnalyst server for remote processing, or they may install the ExpressAnalyst stand-alone (ExpressAnalystSA) Docker for local processing (Liu et al., 2023). When using the remote option, data upload is limited to four concurrent users, users are limited to a maximum of 4-hr upload session, and each user may only store and process 30 GB of FASTQ files at one time. In contrast, the local option allows users to avoid the time-consuming data upload step and does not impose any limitation on the dataset size, although it is slightly more complicated to set up for the first time.
In this protocol, we guide users through installing the ExpressAnalystSA Docker and performing RNA-seq reads quantification with kallisto and Seq2Fun (Bray et al., 2016; Liu et al., 2021). Downstream statistical and functional analysis are covered in Basic Protocols 2 to 5. Note that if users do not envision performing raw data processing and are primarily interested in analyses beginning with a count table, they can skip this protocol and proceed directly to Basic Protocol 2.
Necessary Resources
Hardware
- A computer with internet access, equipped with Intel or AMD CPU, and at least 16 GB RAM and 250 GB free storage
Software
-
An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)
Docker Desktop (see Internet Resources)
Files
- None
1. Go to the ExpressAnalyst homepage and click on the “Tutorials” tab. Click the download link for “Basic Protocol 11 (FASTQ files).” This will navigate to the Xia Lab file server, which hosts large datasets and databases for download. Click “Download.”
This dataset contains 18 sub-sampled FASTQ files, each of which is ∼54 MB. The data are from double-crested cormorant, which does not have a reference transcriptome. There are three experimental conditions (control, medium exposure to chlorpyrifos, and high exposure to chlorpyrifos), with three replicates per condition (Desforges et al., 2021). The full FASTQ files are between 3 and 4 GB per sample, and the full dataset had five replicates per condition. This highly sub-sampled dataset is provided to decrease download and processing time for educational purposes. However, the resulting count table should not be analyzed as the sub-sampled data do not provide reliable gene expression quantification.
2. Expand the zipped file. Inside there should be 18 FASTQ files and a “metadata.txt” file. Right-click the “metadata.txt” file and open it with a spreadsheet program like Microsoft Excel.
The metadata file must be formatted in a very specific way so that ExpressAnalyst can properly recognize the files. RNA-seq datasets usually have paired end reads, meaning that there are two FASTQ files generated per sample. One contains reads that were sequenced from the 5′ to 3′ direction along the forward strands and one with reads that were sequenced from the 5′ to 3′ direction along the reverse strands. In the metadata file, each complete sample gets one row. Two files from the same sample should have the exact same first part of the file name (prefix), differing only in their suffix where one indicates forward reads and one indicates reverse reads (Fig. 52). The sample and group columns are used to label the output results.
A common mistake is using inappropriate sample names, filenames, or group values. This table is read directly into R, therefore all table values should be machine readable and contain no spaces (use “_” instead) or special characters, such as “(” or “/” or “%.”
3. Start the Docker software on your computer. If you are using Docker Desktop, click the Docker icon and wait while the software initializes. You can tell that Docker Desktop is running if the Docker Desktop window shows an overview of your Docker containers, or if you click the small Docker icon (next to where the Wi-Fi connection status is shown) and the dropdown menu says, “Docker Desktop is running.” We do not include screenshots here because Docker Desktop looks different for different operating systems.
This protocol requires the Docker software to be installed on your computer. Implementing Docker on your system is generally outside the scope of this tutorial as we do not have control over Docker configuration or future Docker updates; however, there is up-to-date troubleshooting tips on our ExpressAnalyst Docker Hub page (see Internet Resources). Some operating systems, especially Windows, may require additional steps to get Docker running properly.
4. To download the most recent version of the ExpressAnalystSA Docker image to your computer, open your command line, copy-paste this text and hit enter:
-
docker pull dockerxialab/expressanalyst_docker:latest
It is good practice to run this command each time that you process a dataset, to ensure that you are using the latest version. If you want to use a specific version, for example to exactly reproduce a previous analysis, you can go to https://hub.docker.com/r/dockerxialab/expressanalyst_docker/tags to see all previously published versions. Each version has a “docker pull” command that can be copy-pasted into your command line. Also, note that Windows operating systems typically have two command line programs: “PowerShell” and “Command Prompt.” In our experience, “PowerShell” is the better choice for running Docker containers.
5. Determine your home directory. This can vary depending on your operating system.
For macOS, the home directory is usually “/Users/” followed by your name, for example “/Users/jessica.” For Linux, we recommend using the root, which is just “/.” Windows is more complicated because it can have multiple drives to choose from and the slashes are in the other direction. One example of a Windows home directory is “C:\jessica.”
One way to determine the home directory is to go to the command line and enter “cd ∼” for Linux or Mac or “cd $home” for Windows in PowerShell, and then “pwd.” This will navigate to your home directory and then print out the path.
6. Stay in the command line and enter the command:
- docker run -ti --rm -p 8080:8080 -v HOME_DIRECTORY:/data dockerxialab/expressanalyst_docker:latest
but replace the words HOME_DIRECTORY with the home directory that you determined in the previous step. For example, with a home directory of “/Users/jessica,” the complete command would be:
-
docker run -ti --rm -p 8080:8080 -v /Users/jessica:/data dockerxialab/expressanalyst_docker:latest
Various messages will be printed into your command line while the Docker container is initializing. When you see “ready in ## #### (ms)” (Fig. 53), the Docker container is ready to use.
The last part dockerxialab/expressanalyst_docker:latest is the name of the ExpressAnalyst Docker image. Entering docker run IMAGE_NAME is how you would run any Docker image that you have pulled from Docker Hub. All the rest in the middle are parameters specifying how we want to run this specific Docker container.
The -ti part tells Docker that we want to run the container interactively, allowing us to give ExpressAnalyst information while we are running it (input files, click buttons, etc). The -rm part tells Docker to clean up temporary files on your computer after you stop running the ExpressAnalyst container. The -p 8080:8080 allows us to access the ExpressAnalyst container through a web browser, by entering localhost:8080 in the URL bar. Finally, the -v HOME_DIRECTORY:/data tells Docker to attach the container to your home directory, and to create a new, temporary directory called /data to use while it is running.
7. Open a web browser (we recommend Chrome) and type localhost:8080/ExpressAnalystSA/ in the URL bar. You should see the ExpressAnalyst homepage. Click “Start Here.”
8. We need to first download the reference transcriptome and the ortholog database from the “Databases” page (https://www.expressanalyst.ca/ExpressAnalyst/docs/Databases.xhtml) (Fig. 54A). On the “With a Reference Transcriptome” tab, find the Gallus gallus (chicken) reference transcriptome and click the download icon (Fig. 54B). When the download link finishes loading, click the “Download” button. This file size is 644 MB.
For kallisto-based RNA-seq processing, users must download the reference transcriptomes that have been specifically indexed for kallisto. ExpressAnalyst makes pre-indexed reference transcriptomes available for download for 22 common species. If your species is not available, you can download the reference transcriptome from either NCBI or Ensembl and perform the indexing yourself using the kallisto software. This is beyond the scope of this protocol.
As mentioned before, there is no reference transcriptome for double-crested cormorant, so we will align RNA-seq reads to the chicken transcriptome as this is the most closely related, high-quality reference transcriptome available.
9. Navigate to the “Without a Reference Transcriptome,” find the “Birds” database, and click the download icon (Fig. 54C). When the download link finishes loading, click the “Download” button. The file size is 170 MB.
Here, we outline how to use kallisto while later steps outline how to use Seq2Fun. If you are only interested in Seq2Fun, you do not have to execute the kallisto steps as the two analyses are independent. However, we recommend still reading the steps as some concepts are explained in more detail in the kallisto workflow.
There are 28 databases corresponding to 28 different taxonomic groups. While sequences from ∼700 species were used to define the ortholog groups, we have filtered the database according to different NCBI taxonomy categories of species. When aligning reads to an ortholog database, most hits will be to sequences from more closely related species. Since the size of the database directly impacts the computational efficiency, we can decrease the run time without greatly impacting the quality of the results by choosing a narrower taxonomic group.
If performing a cross-species analysis, you should choose a database that is sufficiently general to cover all species (i.e., choose the vertebrates database if analyzing and comparing data from birds and fish). More details on the species coverage of each database and ortholog group are exposed via the EcoOmicsDB web interface, which was described in Basic Protocol 5.
10. We will create a data directory for the analysis with a reference transcriptome. Create a folder called “Process_Kallisto.” Inside the folder, create three more folders: “DATABASE,” “FASTQ,” and “RESULTS.” Move the downloaded FASTQ files to the “FASTQ” folder and move the downloaded reference transcriptome to the “DATABASE” folder (Fig. 55).
The ExpressAnalyst Docker is configured to recognize this specific directory structure and set of folder names. The names are case sensitive: if you do not use all capital letters, ExpressAnalyst will not recognize the structure. The FASTQ folder should only include FASTQ files, and the filenames must exactly match the filenames in the metadata file. The DATABASE folder should only include the reference transcriptome file (ending in either “.idx” or “.idx.gz”).
11. For Mac or Linux users, double-click the transcriptome file to decompress it (“.idx.gz” to “.idx”). When it is finished, delete the compressed transcriptome file to save space on your computer. If you are using a Windows computer and know how to use your command line to decompress “.gz” files you can do this, otherwise leave the file as is.
ExpressAnalystSA has a built-in method to decompress “.gz” files, but it may be slower compared to the more advanced decompression tools available on your operating systems.
12. Go back to the browser tab where the ExpressAnalystSA Docker is running. Keep the “Data Type” as “Paired-end” and keep the “Analysis Type” as “With reference transcriptome (Kallisto).” Determine the relative path to your data directory (Fig. 56B), by considering the full path to your data directory and removing the home directory path that you used in the “docker run” command. Enter the relative path into the “Data directory” text input. Choose the “metadata.txt” file for the “Metadata file,” click “Submit,” and then click “Proceed.”
This step is tricky for users who are not familiar with Docker. Make sure that the metadata.txt file is stored outside of the job folder. Also make sure that there are no spaces in any of the folder names in the data directory path. If you are using Windows, make sure that the data directory is on the same drive as the home directory that you used in the “docker run” command, otherwise Docker will not be able to access the files.
If you are starting with a compressed transcriptome file, ExpressAnalyst will first decompress the “.gz” file which can take 5 to 10 min. If you run a new analysis starting with a “.idx” instead of a “.idx.gz,” the initialization step will be a few seconds.
13. The Data Integrity Check page contains a summary of all the FASTQ files. Make sure that all files are there and properly labeled, then click “Proceed.”
14. Keep the “Minimum reads quality score” as 25. Find the number of cores on your computer. If you have >4, you can increase the number of cores to make the analysis run faster. It is also acceptable to leave this as 3. Click “Confirm,” click “Submit Job,” and then click “Confirm” in the dialog.
The “Minimum reads quality score” is a threshold used by the “fastp” software to filter out low quality reads prior to quantification with kallisto. Normal values are anywhere between 20 and 30.
15. The processing job has started. Wait until the “Current Status” says “COMPLETE” instead of “RUNNING,” then click “Proceed.”
The “Job Status View” provides a live summary of the progress (Fig. 57). The reads in each file are first filtered using fastp, and then mapped to the reference transcriptome using kallisto. Using a MacBook Pro with 3 cores, this analysis took ∼5 min to complete.
16. Click through the four different tabs to view the summary table and figures. When you are finished viewing the results, click “Download Results” in the bottom right.
The results table in the first tab summarizes the reads mapping. Two columns of interest are the “Clean reads rate (%)” (Fig. 58A, #1) and the “Mapping reads rate (%)” (Fig. 58A, #2). The clean reads rate indicates the percentage of reads that were retained after filtering out the low-quality reads. We should expect a high percentage for each sample (>90%). The mapping reads rate indicates the percentage of reads that were matched to the reference transcriptome. Normally, we would expect >50% of reads to be mapped for a newer reference transcriptome (i.e., the first or second official version) and >70% for a well-established reference (i.e., mouse, human, or another model organism). Here, the rate is lower (∼30%), which is expected since we are mapping reads to a reference transcriptome from a different species.
The PCA plot (Fig. 58C) of the raw counts shows clear separation between the high exposure group and the medium exposure group and control samples.
17. Download the “All_samples_kallisto_txi_counts.txt” file (Fig. 59A) and open it to view the format (Fig. 59B).
This count table can be directly uploaded to ExpressAnalyst under the “single gene expression input,” with the organism as chicken, data type as RNA-seq counts, the ID type as Ensembl transcripts, and the metadata included option checked. Note the values in the “group” column are from the “metadata.txt” file uploaded at the beginning. You can delete the metadata row from the count table and create a separate metadata table if you want to include more variables. Please refer to Basic Protocol 1 and 5 for more details on single gene expression table upload and processing.
18. Navigate back to ExpressAnalyst home and click “Start here” to initiate a new analysis.
ExpressAnalyst initiates a new session whenever the “Start here” button is clicked. We always recommend doing this instead of navigating back to the upload page and uploading new data. Starting a new session refreshes the settings and clears any objects stored in memory, leading to better performance.
19. Create a new folder called “Process_Seq2Fun.” Inside the folder, create three more folders: “DATABASE,” “FASTQ,” and “RESULTS.” Drag the FASTQ files from the previous analysis into the new FASTQ directory. Drag the downloaded birds ortholog database into the DATABASE directory.
We recommend moving the FASTQ files instead of copy-pasting them to avoid having two copies of these large files on your computer.
20. Decompress the database file. Move the “birds_annotation_v2.0.txt” and “birds_v2.0.fmi” files directly inside of the DATABASE folder in the data directory. Delete the empty ‘birds’ folder.
21. Select files, parameters, and initiate job. Change to “Without a reference transcriptome (Seq2Fun).” Use the same metadata table that you used for the kallisto analysis. Click “Confirm” and then “Proceed.”
22. View the “Data Integrity Check” table. It should look the same as for the kallisto workflow since we are using the same samples. Click “Proceed.”
23. Leave the first three parameters as the defaults. If you changed the CPU cores in the kallisto analysis, change the CPU cores input here as well. Click “Confirm,” “Submit Job,” and then “Confirm.”
These parameters control how close the match must be between a translated RNA-seq read and a protein sequence to be considered a hit. The defaults should work for most cases; however, if there is very little coverage of your species taxonomic group in EcoOmicsDB, you can decrease these parameters to compensate for the greater evolutionary divergence between your species and most of the species in the database.
24. Seq2Fun is now running. Wait until the job view summary says “COMPLETED” and that 9 out of 9 samples have been processed. Click “Proceed.”
This analysis took ∼5 min on a MacBook Pro with 3 cores. The text output will be slightly different since Seq2Fun is being run instead of kallisto.
25. View results and interpret Seq2Fun-specific QA/QC parameters. Click “Download Results.”
The % of reads mapped (Fig. 58B, #3) is higher for Seq2Fun compared to kallisto. This makes sense because the Seq2Fun birds database contains sequences from 31 bird species instead of just from chicken. Also, by translating sequences from nucleotides to peptides and adding more tolerance for a small number of mismatches, Seq2Fun is designed to handle evolutionary divergence between the query and target sequences.
The most useful statistic to determine whether there was an adequate match between the sequences from your species and the EcoOmicsDB database is the “Mapping core ortholog rate (%)” (Fig. 58B, #4). Core orthologs are sequences that are present in >90% of species in the database. If there is a good match, we expect this number to be high (70% to 95%). We do not expect a rate of 100% because while core orthologs are likely present in your species database, many still have tissue-specific expression patterns and may not be expressed in your sample. A low core ortholog mapping rate (i.e., 10% to 50%) is cause for concern, and you should perform the mapping again with a more general taxonomic group that includes more species.
The PCA plot (Fig. 58D) also shows a separation between the high and medium and control samples. There is a notable difference between the location of the samples in the kallisto and Seq2Fun PCA plots. This is likely due to the low mapping rate to the reference transcriptome (since we were using the reference from a different species) and because the data were highly sub-sampled. In published evaluations of Seq2Fun, reads were mapped to the reference transcriptome for the species that the data came from (kallisto) and to a custom ortholog database with sequences for that species removed (Seq2Fun). The resulting PCA plots looked nearly identical in all cases (Liu et al., 2021; Liu et al., 2023).
26. Download the “S2fid_abundance_table_all_samples_submit_2_expressanalyst.txt” file and open it to view. We see that the format is the same as the file generated by kallisto, except that the rows are now labeled with Seq2Fun ortholog IDs instead of with Ensembl chicken transcripts. Open the “S2fid_ortholog_annotation_all_samples.txt” file (Fig. 60).
The abundance table can be directly uploaded to ExpressAnalyst online for downstream statistical and functional analysis using the same methods outlined in Basic Protocol 5. If you prefer to analyze the count table using a custom script, the annotation file provides functional annotation information that may be helpful for your analysis.
COMMENTARY
Background Information
As sequencing and mass spectrometry technologies continue to improve, more researchers are collecting these datasets, and the average dataset size and complexity are increasing. ExpressAnalyst is part of the Analyst tool suite, a collection of web-based platforms that have been developed to allow researchers to easily analyze omics data through a user-friendly web interface, including MetaboAnalyst (metaboanalyst.ca) for metabolomics data analysis, NetworkAnalyst (networkanalyst.ca) for gene expression data analysis (before 2019), MicrobiomeAnalyst (microbiomeanalyst.ca) for microbiome data analysis, as well as OmicsAnalyst (omicsanalyst.ca) for multi-omics data integration (Lu et al., 2023; Pang et al., 2022; Zhou et al., 2021; Zhou et al., 2019). The core components of differential expression analysis and meta-analysis were previously published as NetworkAnalyst. In 2019, we decided to keep NetworkAnalyst as a dedicated platform for network analysis and visualization of gene lists, following the naming conventions with other network tools we developed for molecular signature analysis such as miRNet (mirnet.ca) for miRNA lists and OmicsNet (omicsnet.ca) for multi-omics lists (Chang et al., 2020; Zhou & Xia, 2018). The core gene expression profiling and meta-analysis components were separated out to form a new platform, ExpressAnalyst, dedicated for transcriptomics and proteomics data profiling. The split allowed us to efficiently build out ExpressAnalyst to support more data formats, including raw data processing, and more complex experimental designs (Liu et al., 2023). It would become too cumbersome to develop and overwhelming for users to navigate if all these components were included in the same platform. The core of ExpressAnalyst (published as NetworkAnalyst) was originally designed to accommodate one or two categorical metadata for a relatively small number of samples (typical dataset size of 6 to 20 samples covering 2 to 4 experimental conditions). Updating ExpressAnalyst to accommodate larger and more complicated datasets required modifying nearly every page. We have made significant efforts to keep the interface and terminology consistent with previously published versions so that analyses can be reproduced, and the previously published protocols can still be followed.
Limitations
Despite its comprehensive support for statistical and functional analysis of gene expression data coupled with powerful interactive visualization, ExpressAnalyst currently does not support supervised machine learning analysis such as random forests, support vector machine (SVM) for classification tasks; or more advanced unsupervised clustering approaches, such as non-negative matrix factorization (NMF) or those based on deep learning approaches. ExpressAnalyst currently does not support analysis of single cell or spatial transcriptomics data, which have become increasingly common in recent years. We plan to implement functions to support these data types in the coming years.
Other Similar Tools
Gene expression data analysis is probably the most common omics data analysis tasks. Despite the tremendous progress made over the past two decades, it remains challenging for most clinicians and bench scientists. The community has taken two major approaches to address this gap: 1) the Bioconductor project (Gentleman et al., 2004), which encourages researchers to learn to use R programming language to perform omics data analysis; and 2) the Galaxy project (Jalili et al., 2020), which offers web interface for omics data processing. Both are very successful and are widely used by the research community. ExpressAnalyst couples well-established R packages with cutting-edge JavaScript libraries to provide streamlined gene expression analysis and visualization through its modern web interface. We recommend the GenePattern platform for machine learning approaches with graphical user interface support (Kuehn et al., 2008).
Critical Parameters
One advantage of software-based protocols is that analyses can be performed again quickly when errors are made, without consuming any expensive materials. Also, by including many screenshots of results, readers should be able to identify accidental deviations from the protocol steps soon after they happen. With that said, there are two areas to which we draw attention: the protocols that are critical for understanding others, and the computing requirements for raw data processing.
Each of the Basic Protocols introduces a distinct concept; however, the protocols are not completely independent. Some explicitly depend on previous protocols, for example Basic Protocols 1 to 4 are designed to be performed sequentially, as are Basic Protocol 7 to 8. More commonly, many steps are repeated in multiple protocols; however, the details on their statistical approaches and rationales are only outlined once in the earliest protocol that they appear in. We have tried to indicate throughout the text where readers can go for more details; however, we highly recommend that readers perform Basic Protocols 1 to 3 first, as they introduce many fundamental concepts of transcriptomics analysis that are referred to throughout the other protocols. Basic Protocols 4 to 10 can be performed in any order, and readers can pick the topics that they are interested in.
Basic Protocol 11 requires users to have access to modern computing environments and to perform certain software configurations, as FASTQ file processing is a computationally intensive task. First, Docker must be installed locally and configured to find the FASTQ files and reference transcriptomes or ortholog databases on your computer. While we provide general guidelines and a protocol that should work for most people, it is not feasible to outline all possible issues that may occur on every operating system. Second, the reference genomes and ortholog databases are large files (several GB after decompression) and can take a long time to download, depending on the current file server load. Make sure to carefully read the computer specifications in the Basic Protocol 11 introduction before beginning the protocol.
Troubleshooting
Some common problems and their solutions are summarized in the Table 1.
Problem | Possible cause | Possible solution |
---|---|---|
Empty results table | The analysis session has expired | Restart the analysis from the beginning; do not wait too long in between analysis steps (>20 min) |
Sudden errors on steps that previously worked | Running ExpressAnalyst on multiple tabs in the same browser caused the analyses to interfere with each other | Keep only one ExpressAnalyst tab open at a time |
ExpressAnalyst has a different interface on different browsers | If you ran ExpressAnalyst in the past, there may be some cached information in your browser that is preventing you from seeing the latest version | Clear your cookies and cache and refresh the tool |
The blue troubleshooting screen appears after data upload | The Xia Lab server reaches its capacity or may be temporarily down for maintenance | Check https://omicsforum.ca/ to see if other users are also noticing that the website is down; if not, open a post to notify the team |
Understanding Results
Basic Protocol 1
This protocol is designed to show users how to prepare an RNA-seq count table for downstream differential expression and functional analysis. This is typically the first step that most researchers encounter when analyzing transcriptomics data for the first time, hence we introduce the data and metadata formats. Figure 2 is a screenshot of the data; users can validate that they have uploaded the correct data by ensuring that their data matches the data used in the protocol. Figure 4 shows box plots and PCA plots both before and after normalization. The normalized data are the output of this protocol and are analyzed in Basic Protocols 2 to 4. If your results do not match Figure 4, go back and carefully check the settings in the data upload, metadata check, and filtering and normalization page.
Basic Protocol 2
This protocol introduces users to the statistical concepts behind using generalized linear models for differential expression analysis. Linear models are flexible and can be configured to accommodate almost any experimental design; however, the statistical concepts quickly become complex. In this protocol, we try to introduce new concepts in a gradual manner, starting with a simple comparison (steps 3 and 5), followed by accounting for covariates while comparing groups (step 7), then inclusion of continuous variables in addition to discrete variables (step 9), and finally considering interactions between metadata variables (step 12). After each step, the number of DEGs are reported in the protocol text, so that users can compare their results to ensure they are doing the analysis correctly. While this protocol does not cover every possible linear model configuration, it should provide users with a solid foundation to understand the approach, such that they can select the appropriate model configurations in the future.
Basic Protocol 3
Basic Protocol 3 carries on with the same dataset used in Basic Protocols 1 and 2. The objective of this protocol is to demonstrate functional analysis with both the overrepresentation analysis (ORA) and gene set enrichment analysis (GSEA) approaches. Functional analysis is essential for interpreting the potential biological processes that underlie the lists of significant features (for ORA), or the entire ranked genes (for GSEA). ORA is demonstrated with the volcano plot tool (steps 3 to 7), where we show how the analysis can be performed separately for up and downregulated genes, and the enrichment network tool (steps 9 to 11). GSEA is demonstrated with the ridgeline plot tool (steps 13 to 15). Together, these different visual analytics tools and analysis strategies provide complementary perspectives on the functional profiles within a transcriptomics dataset.
Basic Protocol 4
Basic Protocol 4 is the last one that uses the BPA exposure RNA-seq dataset. Here, we perform a more exploratory analysis, allowing unsupervised hierarchical clustering and visual pattern detection to guide targeted functional analysis. First, the concept of hierarchical clustering is explained in detail, and then we show how to identify, select, and interpret groups of genes with interesting patterns (steps 7 to 16). This protocol is the least deterministic, so users should not be too concerned if their results do not exactly match the figures in the text. Instead, we focus on understanding the general approach, and how to report the results in a way that is transparent and reproducible (step 12).
Basic Protocol 5
This protocol is designed to show users how to use the same methods explained in detail in Basic Protocols 1 to 4, but for a dataset from non-model species that do not have a reference transcriptome. While navigating through the same standard RNA-seq count table analysis, we highlight how the Seq2Fun annotation and functional libraries are included to unlock powerful analytical methods for datasets that were previously very difficult to analyze. There are several new concepts introduced in Basic Protocol 5. We explain the concept of random effects, and how to use them in differential expression analysis (step 6). We also introduce the dimensionality reduction tool, which allows users to perform PCA and view the top three components and their loading scores (steps 12 to 14), and the GSEA heatmap tool (steps 16 to 17).
Basic Protocol 6
This protocol introduces users to the main filtering and normalization approaches for microarray and proteomics data. This protocol is designed to be interchangeable with Basic Protocol 1; once normalization is completed, the same steps in Basic Protocol 2 to 4 can be performed regardless of whether the input data is RNA-seq, microarray, or proteomics. The first half (steps 2 to 9) outlines microarray normalization and the second half (steps 10 to 16) outlines missing value imputation and normalization for proteomics data. We focus on showing the differences across the different normalization methods, to help users choose which one may be the most appropriate for their data. Figures 33 and 35 provide benchmarks for users to compare their results to ensure that they are using the correct methods.
Basic Protocol 7
This protocol extends the concepts related to filtering and normalization of a single table in Basic Protocol 1 and Basic Protocol 6 to situations where there are multiple tables. All the statistical concepts are the same when normalizing each individual table; the purpose of this protocol is mainly to introduce the more sophisticated interface for handling multiple tables. The only new concept is batch effect correction (step 9), which is performed after normalization of each table.
Basic Protocol 8
Basic Protocol 8 picks up where Basic Protocol 7 left off and introduces approaches to compare and integrate differential expression analysis results across multiple datasets. The first step is to perform differential expression analysis separately for each dataset (steps 1 to 2). Then, a range of strategies for integrating the differential expression analysis statistics is presented and a method is chosen (steps 3 to 5). Finally, we introduce two different visual analytics tools for performing an integrative functional analysis of the meta-analysis results. We start with the enrichment network (steps 7 to 10), and end with the upset diagram (steps 11 to 13). Again, many of the statistical concepts are the same as described in previous Basic Protocols, hence the steps in this protocol are mainly focused on showing how to manipulate a more complicated interface that is designed to handle multiple datasets.
Basic Protocol 9
Basic Protocol 9 is short, as uploading gene lists allows us to skip all the filtering, normalization, and differential analysis steps. The new concept introduced here is the adjusted heatmap format that is designed for visually comparing lists of features (steps 4 to 6).
Basic Protocol 10
This protocol is designed to analyze data from a very specific experimental design. Few published datasets meet these requirements unless they were specifically designed for this type of analysis. This analysis is most frequently performed as dose-response studies in toxicology; however, in theory this same pipeline can be applied to any dataset with multiple replicates collected from groups along a continuous gradient. As the cost of acquiring transcriptomics data decreases, designing studies around this method for time-series or other continuous gradients is becoming feasible for many research groups. Thus, throughout the pipeline, we keep the terminology developed by the toxicology community (e.g., benchmark dose, point-of-departure) to maintain consistency with the literature but highlight how the method is applicable to other contexts wherever possible.
Basic Protocol 11
This protocol will have the most variable steps and duration for different users as it is the only one that depends heavily upon the local computing hardware and operating system. Users with low-end laptop computers may have trouble running the local Docker. We provide time estimates throughout the protocol; however, these will depend on the available RAM and CPU specifications of users’ computers. After guiding the user to install Docker and get the ExpressAnalystSA Docker running (steps 1 to 9), the protocol introduces how to process FASTQ files using two approaches: kallisto for species with a reference genome (steps 10 to 17), and Seq2Fun for species without (steps 18 to 26).
Time Considerations
Each of the protocols takes ∼20 min to complete, other than Basic Protocol 11, which may take 30 to 40 min. Together, it should take ∼3.5 hr to complete all protocols.
Acknowledgments
The authors thank the Canadian Institutes of Health Research (CIHR), the Natural Sciences and Engineering Research Council of Canada (NSERC), and the Canada Research Chairs (CRC) Program for funding support. The authors thank Xia Yang and Graciel Diamante for providing us with the individual measurements of the bodyweight, insulin secretion, and targeted lipids for the BPA mouse dataset (Basic Protocols 1 to 4).
Author Contributions
Jessica Ewald: Conceptualization; data curation; formal analysis; software; validation; visualization; writing original draft; writing review and editing. Guangyan Zhou: Data curation; software; validation; visualization. Yao Lu: Software; validation; visualization. Jianguo (Jeff) Xia: Conceptualization; funding acquisition; project administration; software; supervision; validation; writing original draft; writing review and editing.
Conflict of Interest
The authors declare the following competing interests: J.E., G.Z., and J.X. own shares of OmicSquare Analytics Inc. The remaining authors declare no competing interests.
Open Research
Data Availability Statement
All datasets required to perform the protocols are available as built-in example data throughout the ExpressAnalyst software modules or can be downloaded from the “Tutorials” tab on the ExpressAnalyst website (www.expressanalyst.ca).