DGE analysis with STAR input
Load required libraries
library(data.table)
library(DESeq2)
library(apeglm)
library(ggplot2)
library(ggrepel)
library(EnhancedVolcano)Import input data
Load clinical information and define file path
dir = "../../"
sampleData = paste0(dir, "clinical.txt")
sampleData = fread(sampleData)
rownames(sampleData) = sampleData$ENA_RUN
sampleData$individual = as.factor(sampleData$individual)
sampleData$paris_classification = as.factor(sampleData$paris_classification)
# Use relevel() to set adjacent normal samples as reference
sampleData$paris_classification = relevel(sampleData$paris_classification, ref = "normal")
files = list.files(paste0(dir, "star"), "*ReadsPerGene.out.tab$", full.names = T)> files
[1] "../../star/ERR2675454_ReadsPerGene.out.tab"
[2] "../../star/ERR2675455_ReadsPerGene.out.tab"
[3] "../../star/ERR2675458_ReadsPerGene.out.tab"
[4] "../../star/ERR2675459_ReadsPerGene.out.tab"
[5] "../../star/ERR2675460_ReadsPerGene.out.tab"
[6] "../../star/ERR2675461_ReadsPerGene.out.tab"
[7] "../../star/ERR2675464_ReadsPerGene.out.tab"
[8] "../../star/ERR2675465_ReadsPerGene.out.tab"
[9] "../../star/ERR2675468_ReadsPerGene.out.tab"
[10] "../../star/ERR2675469_ReadsPerGene.out.tab"
[11] "../../star/ERR2675472_ReadsPerGene.out.tab"
[12] "../../star/ERR2675473_ReadsPerGene.out.tab"
[13] "../../star/ERR2675476_ReadsPerGene.out.tab"
[14] "../../star/ERR2675477_ReadsPerGene.out.tab"
[15] "../../star/ERR2675478_ReadsPerGene.out.tab"
[16] "../../star/ERR2675479_ReadsPerGene.out.tab"
[17] "../../star/ERR2675480_ReadsPerGene.out.tab"
[18] "../../star/ERR2675481_ReadsPerGene.out.tab"
[19] "../../star/ERR2675484_ReadsPerGene.out.tab"
[20] "../../star/ERR2675485_ReadsPerGene.out.tab"Import STAR count file and keep the 4th column containing the reverse stranded counts. Build a countData data.frame to store counts
countData = data.frame(fread(files[1]))[c(1,4)]
# Loop and read the 4th column remaining files
for(i in 2:length(files)) {
countData = cbind(countData, data.frame(fread(files[i]))[4])
}
# Skip first 4 lines, count data starts on the 5th line
countData = countData[c(5:nrow(countData)),]
colnames(countData) = c("GeneID", gsub(paste0(dir,"star/"), "", files))
colnames(countData) = gsub("_ReadsPerGene.out.tab", "", colnames(countData))
rownames(countData) = countData$GeneID
countData = countData[,c(2:ncol(countData))]> countData[1:10,1:5]
ERR2675454 ERR2675455 ERR2675458 ERR2675459 ERR2675460
ENSG00000223972.5 0 0 0 0 0
ENSG00000227232.5 32 33 13 70 97
ENSG00000278267.1 0 0 2 3 4
ENSG00000243485.5 0 0 0 0 0
ENSG00000284332.1 0 0 0 0 0
ENSG00000237613.2 0 0 0 0 0
ENSG00000268020.3 0 0 0 0 0
ENSG00000240361.2 0 0 0 0 0
ENSG00000186092.6 0 0 0 0 0
ENSG00000238009.6 0 2 0 0 9Build a DESeqDataSet from countData with DESeqDataSetFromMatrix, providing also the sample information and a design formula
dds = DESeqDataSetFromMatrix(countData = countData,
colData = sampleData, design = ~ individual + paris_classification)> dds
class: DESeqDataSet
dim: 60662 20
metadata(1): version
assays(1): counts
rownames(60662): ENSG00000223972.5 ENSG00000227232.5 ... ENSG00000210195.2 ENSG00000210196.2
rowData names(0):
colnames(20): ERR2675454 ERR2675455 ... ERR2675484 ERR2675485
colData names(11): ENA_RUN individual ... braf_status kras_statusDifferential expression analysis
Run DESeq2 analysis using DESeq, which performs (1) estimation of size factors, (2) estimation of dispersion, then (3) Negative Binomial GLM fitting and Wald statistics. The results tables (log2 fold changes and p-values) can be generated using the results function
dds = DESeq(dds)> cbind(resultsNames(dds))
[,1]
[1,] "Intercept"
[2,] "individual_S11_vs_S1"
[3,] "individual_S12_vs_S1"
[4,] "individual_S15_vs_S1"
[5,] "individual_S17_vs_S1"
[6,] "individual_S3_vs_S1"
[7,] "individual_S5_vs_S1"
[8,] "individual_S6_vs_S1"
[9,] "individual_S7_vs_S1"
[10,] "individual_S9_vs_S1"
[11,] "paris_classification_0.IIa_vs_normal"We set the adjusted p-value cutoff (FDR) to be 0.05, hence we change the default significance cutoff used for optimizing the independent filtering alpha from 0.1 to 0.05
res <- results(dds, name = "paris_classification_0.IIa_vs_normal", alpha = 0.05)At a FDR of 0.05, we have 3129 genes up-regulated and 2916 genes down-regulated in 0-IIa pre-lesion versus Normal
> summary(res)
out of 39903 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 3129, 7.8%
LFC < 0 (down) : 2916, 7.3%
outliers [1] : 0, 0%
low counts [2] : 15801, 40%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?resultsThe results res object contains the follow columns
> mcols(res)$description
[1] "mean of normalized counts for all samples"
[2] "log2 fold change (MLE): paris classification 0.IIa vs normal"
[3] "standard error: paris classification 0.IIa vs normal"
[4] "Wald statistic: paris classification 0.IIa vs normal"
[5] "Wald test p-value: paris classification 0.IIa vs normal"
[6] "BH adjusted p-values"
> head(res)
log2 fold change (MLE): paris classification 0.IIa vs normal
Wald test p-value: paris classification 0.IIa vs normal
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000223972.5 0.196142797581091 -0.0510133328786562 3.01727024534903 -0.0169071142889142 0.986510717197277 NA
ENSG00000227232.5 65.2360917381934 -0.148867737983724 0.348724795003796 -0.426891749931643 0.66945817485052 0.816108488708069
ENSG00000278267.1 2.73563541954875 0.668374526475476 0.80813859554197 0.8270543322179 0.408206266889902 0.618298738745137
ENSG00000243485.5 0 NA NA NA NA NA
ENSG00000284332.1 0 NA NA NA NA NA
ENSG00000237613.2 0 NA NA NA NA NARead about authors' note on p-values and adjusted p-values set to NA here
Exploring results
MA-plot
We use MA-plot to show the log2 fold changes attributable to a given variable (i.e. pre-lesion) over the mean of normalized counts for all the samples. Points will be colored red if the adjusted p value is less than 0.05. Points which fall out of the window are plotted as open triangles pointing either up or down
resLFC = lfcShrink(dds, coef = "paris_classification_0.IIa_vs_normal",
type="apeglm")
png("DGE_MA-plot.STAR.png", width=7, height=5, units = "in", res = 300)
plotMA(resLFC, alpha = 0.05, ylim=c(-6,6),
main = "MA-plot for the shrunken log2 fold changes")
dev.off()
Principal component plot of the samples
First, we perform count data transformation with regularized logarithm rlog or variance stabilizing transformations vst. You can read here about which transformation to choose
rld = rlog(dds)
vsd = vst(dds)Perform sample PCA for transformed data using plotPCA, then plot with ggplot
# rlog
pcaData = plotPCA(rld, intgroup=c("individual","paris_classification"),
returnData=TRUE)
percentVar = round(100 * attr(pcaData, "percentVar"))
png("DGE_PCA-rlog.STAR.png", width=7, height=7, units = "in", res = 300)
ggplot(pcaData, aes(PC1, PC2, colour = paris_classification)) +
geom_point(size = 2) + theme_bw() +
scale_color_manual(values = c("blue", "red")) +
geom_text_repel(aes(label = individual), nudge_x = -1, nudge_y = 0.2, size = 3) +
ggtitle("Principal Component Analysis (PCA)", subtitle = "rlog transformation") +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance"))
dev.off()
# vst
pcaData = plotPCA(vsd, intgroup=c("individual","paris_classification"),
returnData=TRUE)
percentVar = round(100 * attr(pcaData, "percentVar"))
png("DGE_PCA-vst.STAR.png", width=7, height=7, units = "in", res = 300)
ggplot(pcaData, aes(PC1, PC2, colour = paris_classification)) +
geom_point(size = 2) + theme_bw() +
scale_color_manual(values = c("blue", "red")) +
geom_text_repel(aes(label = individual), nudge_x = -1, nudge_y = 0.2, size = 3) +
ggtitle("Principal Component Analysis (PCA)", subtitle = "vst transformation") +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance"))
dev.off()
Volcano plots
We use EnhancedVolcano to generate volcano plot to visualise the results of differential expression analyses
pCutoff = 0.05
FCcutoff = 1.0
p = EnhancedVolcano(data.frame(res), lab = NA, x = 'log2FoldChange', y = 'padj',
xlab = bquote(~Log[2]~ 'fold change'), ylab = bquote(~-Log[10]~adjusted~italic(P)),
pCutoff = pCutoff, FCcutoff = FCcutoff, pointSize = 1.0, labSize = 2.0,
title = "Volcano plot", subtitle = "SSA/P vs. Normal",
caption = paste0('log2 FC cutoff: ', FCcutoff, '; p-value cutoff: ', pCutoff, '\nTotal = ', nrow(res), ' variables'),
legend=c('NS','Log2 FC','Adjusted p-value', 'Adjusted p-value & Log2 FC'),
legendPosition = 'bottom', legendLabSize = 14, legendIconSize = 5.0)
png("DGE_VolcanoPlots.STAR.png", width=7, height=7, units = "in", res = 300)
print(p)
dev.off()
Exporting results
We construct a normData data.frame to store per-group normalised mean and normalised counts of all samples, and a deData data.frame to store the DESeq results. We merge both data.frame with a annoData data.frame that contain per-gene annotation derived from GENCODE GTF file before exporting the results in tabular format
annoData = "/home/USER/db/refanno/gencode.v33.annotation_genes.txt"
annoData = data.frame(fread(annoData))
normCounts = as.data.frame(counts(dds, normalized = TRUE))
baseMeans = as.data.frame(sapply( levels(dds$paris_classification),
function(lvl) rowMeans( counts(dds, normalized = TRUE)[, dds$paris_classification == lvl, drop = FALSE] ) ))
normData = merge(annoData, merge(baseMeans, normCounts, by.x = 'row.names', by.y = 'row.names'), by.x = 'GeneID', by.y = 'Row.names')
normData = normData[order(normData$Chromosome, normData$Start, normData$End),]
deData = data.frame(res[,c(1,2,5,6)])
colnames(deData) = c("baseMean","log2fc","pvalue","padj")
deData = merge(annoData, deData, by.x = 'GeneID', by.y = 'row.names')
deData = deData[order(deData$Chromosome, deData$Start, deData$End),]
write.table(normData, file="DGE_DESeq2_Means_and_NormalisedCount.STAR.txt",
sep = "\t", quote = F, row.names = F, col.names = T)
write.table(deData, file="DGE_DESeq2_DE_results.STAR.txt",
sep = "\t", quote = F, row.names = F, col.names = T)> normData[10000:10005,1:12]
GeneID GeneSymbol Chromosome Start End Class Strand Length normal 0-IIa ERR2675454 ERR2675455
9184 ENSG00000149257.14 SERPINH1 chr11 75562056 75572783 protein_coding + 10727 804.9909532 850.91390781 571.034788 817.82929
42777 ENSG00000255326.1 AP001922.5 chr11 75583196 75594665 lncRNA + 11469 0.3676455 1.86683323 0.000000 0.00000
13087 ENSG00000171533.11 MAP6 chr11 75586918 75669120 protein_coding - 82202 11.6539986 11.06950146 3.294431 18.99825
42876 ENSG00000255434.1 AP001922.6 chr11 75596144 75597270 lncRNA - 1126 0.0000000 0.07527231 0.000000 0.00000
42172 ENSG00000254630.1 AP001922.2 chr11 75635883 75638587 lncRNA - 2704 0.0000000 0.00000000 0.000000 0.00000
42735 ENSG00000255280.1 AP001922.4 chr11 75642600 75642889 processed_pseudogene + 289 0.0000000 0.00000000 0.000000 0.00000
> deData[10000:10005,]
GeneID GeneSymbol Chromosome Start End Class Strand Length baseMean log2fc pvalue padj
9184 ENSG00000149257.14 SERPINH1 chr11 75562056 75572783 protein_coding + 10727 827.95243051 0.081912786 0.6410465 0.7998397
42777 ENSG00000255326.1 AP001922.5 chr11 75583196 75594665 lncRNA + 11469 1.11723936 1.894908372 0.1279743 NA
13087 ENSG00000171533.11 MAP6 chr11 75586918 75669120 protein_coding - 82202 11.36175003 -0.004785778 0.9900707 0.9957721
42876 ENSG00000255434.1 AP001922.6 chr11 75596144 75597270 lncRNA - 1126 0.03763615 0.137222714 0.9638312 NA
42172 ENSG00000254630.1 AP001922.2 chr11 75635883 75638587 lncRNA - 2704 0.00000000 NA NA NA
42735 ENSG00000255280.1 AP001922.4 chr11 75642600 75642889 processed_pseudogene + 289 0.00000000 NA NA NASession info
> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS/LAPACK: /home/USER/miniconda3/lib/libopenblasp-r0.3.9.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] EnhancedVolcano_1.4.0 ggrepel_0.8.2
[3] ggplot2_3.3.0 apeglm_1.8.0
[5] DESeq2_1.26.0 SummarizedExperiment_1.16.1
[7] DelayedArray_0.12.3 BiocParallel_1.20.1
[9] matrixStats_0.56.0 Biobase_2.46.0
[11] GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
[13] IRanges_2.20.2 S4Vectors_0.24.4
[15] BiocGenerics_0.32.0 data.table_1.12.8
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.6.2 Formula_1.2-3
[4] assertthat_0.2.1 latticeExtra_0.6-29 blob_1.2.1
[7] GenomeInfoDbData_1.2.2 numDeriv_2016.8-1.1 pillar_1.4.4
[10] RSQLite_2.2.0 backports_1.1.6 lattice_0.20-41
[13] glue_1.4.0 bbmle_1.0.23.1 digest_0.6.25
[16] RColorBrewer_1.1-2 XVector_0.26.0 checkmate_2.0.0
[19] colorspace_1.4-1 plyr_1.8.6 htmltools_0.4.0
[22] Matrix_1.2-18 XML_3.99-0.3 pkgconfig_2.0.3
[25] emdbook_1.3.12 genefilter_1.68.0 zlibbioc_1.32.0
[28] mvtnorm_1.1-0 purrr_0.3.4 xtable_1.8-4
[31] scales_1.1.0 jpeg_0.1-8.1 htmlTable_1.13.3
[34] tibble_3.0.1 annotate_1.64.0 farver_2.0.3
[37] ellipsis_0.3.0 withr_2.2.0 nnet_7.3-14
[40] survival_3.1-12 magrittr_1.5 crayon_1.3.4
[43] memoise_1.1.0 MASS_7.3-51.6 foreign_0.8-76
[46] tools_3.6.2 lifecycle_0.2.0 stringr_1.4.0
[49] locfit_1.5-9.4 munsell_0.5.0 cluster_2.1.0
[52] AnnotationDbi_1.48.0 compiler_3.6.2 rlang_0.4.6
[55] grid_3.6.2 RCurl_1.98-1.2 rstudioapi_0.11
[58] htmlwidgets_1.5.1 labeling_0.3 bitops_1.0-6
[61] base64enc_0.1-3 gtable_0.3.0 DBI_1.1.0
[64] R6_2.4.1 gridExtra_2.3 bdsmatrix_1.3-4
[67] knitr_1.28 dplyr_0.8.5 bit_1.1-15.2
[70] Hmisc_4.4-0 stringi_1.4.6 Rcpp_1.0.4.6
[73] geneplotter_1.64.0 vctrs_0.2.4 rpart_4.1-15
[76] acepack_1.4.1 png_0.1-7 coda_0.19-3
[79] tidyselect_1.0.0 xfun_0.13 Last updated
Was this helpful?