DTE analysis with STAR + RSEM input

Load required libraries

library(data.table)
library(DESeq2)
library(apeglm)
library(ggplot2)
library(ggrepel)
library(EnhancedVolcano)

Import input data

Load clinical information and define file path

dir = "../../"
sampleData = paste0(dir, "clinical.txt")
sampleData = fread(sampleData)
rownames(sampleData) = sampleData$ENA_RUN
sampleData$individual = as.factor(sampleData$individual)
sampleData$paris_classification = as.factor(sampleData$paris_classification)

# Use relevel() to set adjacent normal samples as reference
sampleData$paris_classification = relevel(sampleData$paris_classification, ref = "normal")

files = list.files(paste0(dir, "rsem_star"), "*soforms.results$", full.names = T)

Import RSEM result file and keep the 5th column containing the expected_count values. Build a countData data.frame to store counts

Build a DESeqDataSet from countData with DESeqDataSetFromMatrix, providing also the sample information and a design formula

* Import input data (tximport)

Alternatively, tximport can be used to import expected_count from RSEM, as follows

Differential expression analysis

Run DESeq2 analysis using DESeq, which performs (1) estimation of size factors, (2) estimation of dispersion, then (3) Negative Binomial GLM fitting and Wald statistics. The results tables (log2 fold changes and p-values) can be generated using the results function

We set the adjusted p-value cutoff (FDR) to be 0.05, hence we change the default significance cutoff used for optimizing the independent filtering alpha from 0.1 to 0.05

At a FDR of 0.05, we have 2945 genes up-regulated and 2871 genes down-regulated in 0-IIa pre-lesion versus Normal

The results res object contains the follow columns

Read about authors' note on p-values and adjusted p-values set to NA here

Exploring results

MA-plot

We use MA-plot to show the log2 fold changes attributable to a given variable (i.e. pre-lesion) over the mean of normalized counts for all the samples. Points will be colored red if the adjusted p value is less than 0.05. Points which fall out of the window are plotted as open triangles pointing either up or down

MA-plot

Principal component plot of the samples

First, we perform count data transformation with regularized logarithm rlog or variance stabilizing transformations vst. You can read here about which transformation to choose

Perform sample PCA for transformed data using plotPCA, then plot with ggplot

PCA (rlog)
PCA (vst)

Volcano plots

We use EnhancedVolcano to generate volcano plot to visualise the results of differential expression analyses

Volcano plot

Exporting results

We construct a normData data.frame to store per-group normalised mean and normalised counts of all samples, and a deData data.frame to store the DESeq results. We merge both data.frame with a annoData data.frame that contain per-transcript annotation derived from GENCODE GTF file before exporting the results in tabular format

Session info

Last updated

Was this helpful?