dir ="../../"sampleData =paste0(dir, "clinical.txt")sampleData =fread(sampleData)rownames(sampleData) = sampleData$ENA_RUNsampleData$individual =as.factor(sampleData$individual)sampleData$paris_classification =as.factor(sampleData$paris_classification)# Use relevel() to set adjacent normal samples as referencesampleData$paris_classification =relevel(sampleData$paris_classification, ref ="normal")files =file.path(paste0(dir, "salmon"), list.files(paste0(dir, "salmon")), "quant.sf")names(files) = list.files(paste0(dir, "salmon"))
Load TxDB
Load TxDB and construct a two column data.frame tx2gene with the transcript and gene identifiers
Import Salmon NumReads values with tximport which contains the transcript-level counts (txIn = TRUE), abundances and average transcript lengths, then output gene-level summarization (txOut = FALSE)
Build a DESeqDataSet from txi, providing also the sample information and a design formula. The DESeqDataSetFromTximport function will take the txi$counts values and use the round function to round them to nearest integer as documented here. The design indicates how to model the samples, here, that we want to measure the difference in expression between pre-lesion and adjacent normal tissue in a paired-sample design
Run DESeq2 analysis using DESeq, which performs (1) estimation of size factors, (2) estimation of dispersion, then (3) Negative Binomial GLM fitting and Wald statistics. The results tables (log2 fold changes and p-values) can be generated using the results function
We set the adjusted p-value cutoff (FDR) to be 0.05, hence we change the default significance cutoff used for optimizing the independent filtering alpha from 0.1 to 0.05
res <-results(dds, name ="paris_classification_0.IIa_vs_normal", alpha =0.05)
At a FDR of 0.05, we have 3274 genes up-regulated and 3342 genes down-regulated in 0-IIa pre-lesion versus Normal
>summary(res)out of 37313 with nonzero total read countadjusted p-value <0.05LFC >0 (up) :3274, 8.8%LFC <0 (down) :3342, 9%outliers [1] :0, 0%low counts [2] :15400, 41%(mean count <2)[1] see 'cooksCutoff' argument of ?results[2] see 'independentFiltering' argument of ?results
The results res object contains the follow columns
>mcols(res)$description[1] "mean of normalized counts for all samples"[2] "log2 fold change (MLE): paris classification 0.IIa vs normal"[3] "standard error: paris classification 0.IIa vs normal"[4] "Wald statistic: paris classification 0.IIa vs normal"[5] "Wald test p-value: paris classification 0.IIa vs normal"[6] "BH adjusted p-values">head(res)log2 fold change (MLE): paris classification 0.IIa vs normal Wald test p-value: paris classification 0.IIa vs normal DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj<numeric><numeric><numeric><numeric><numeric><numeric>ENSG00000000003.151546.35486175611-0.5813116195780960.174441213818185-3.332421317498870.0008609381121569930.00453612331129988ENSG00000000005.633.9170595345844-0.1273958016684890.36047271289215-0.3534131630834560.7237787190104410.832554491846499ENSG00000000419.12810.2693650792780.03303534449731760.1640820740112460.2013342694281970.8404372080282870.909277206454224ENSG00000000457.14433.817917089428-0.2065472720359160.12218101991675-1.690502110529530.09093193482336350.201842735796633ENSG00000000460.17134.7761009840460.1542471800400750.2212783707917850.6970730102908070.4857571038252570.65367203488841ENSG00000000938.1353.64317446420570.08913425305642190.2880150188552730.3094777953270960.7569581004409970.855339071823214
Read about authors' note on p-values and adjusted p-values set to NA here
Exploring results
MA-plot
We use MA-plot to show the log2 fold changes attributable to a given variable (i.e. pre-lesion) over the mean of normalized counts for all the samples. Points will be colored red if the adjusted p value is less than 0.05. Points which fall out of the window are plotted as open triangles pointing either up or down
resLFC =lfcShrink(dds, coef ="paris_classification_0.IIa_vs_normal", type="apeglm")png("DGE_MA-plot.Salmon.png", width=7, height=5, units ="in", res =300)plotMA(resLFC, alpha =0.05, ylim=c(-6,6), main ="MA-plot for the shrunken log2 fold changes")dev.off()
Principal component plot of the samples
First, we perform count data transformation with regularized logarithm rlog or variance stabilizing transformations vst. You can read here about which transformation to choose
rld =rlog(dds)vsd =vst(dds)
Perform sample PCA for transformed data using plotPCA, then plot with ggplot
We construct a normData data.frame to store per-group normalised mean and normalised counts of all samples, and a deData data.frame to store the DESeq results. We merge both data.frame with a annoData data.frame that contain per-gene annotation derived from GENCODE GTF file before exporting the results in tabular format