# DGE analysis with STAR input

## Load required libraries

```r
library(data.table)
library(DESeq2)
library(apeglm)
library(ggplot2)
library(ggrepel)
library(EnhancedVolcano)
```

## Import input data

Load clinical information and define file path

```r
dir = "../../"
sampleData = paste0(dir, "clinical.txt")
sampleData = fread(sampleData)
rownames(sampleData) = sampleData$ENA_RUN
sampleData$individual = as.factor(sampleData$individual)
sampleData$paris_classification = as.factor(sampleData$paris_classification)

# Use relevel() to set adjacent normal samples as reference
sampleData$paris_classification = relevel(sampleData$paris_classification, ref = "normal")

files = list.files(paste0(dir, "star"), "*ReadsPerGene.out.tab$", full.names = T)
```

```r
> files
 [1] "../../star/ERR2675454_ReadsPerGene.out.tab"
 [2] "../../star/ERR2675455_ReadsPerGene.out.tab"
 [3] "../../star/ERR2675458_ReadsPerGene.out.tab"
 [4] "../../star/ERR2675459_ReadsPerGene.out.tab"
 [5] "../../star/ERR2675460_ReadsPerGene.out.tab"
 [6] "../../star/ERR2675461_ReadsPerGene.out.tab"
 [7] "../../star/ERR2675464_ReadsPerGene.out.tab"
 [8] "../../star/ERR2675465_ReadsPerGene.out.tab"
 [9] "../../star/ERR2675468_ReadsPerGene.out.tab"
[10] "../../star/ERR2675469_ReadsPerGene.out.tab"
[11] "../../star/ERR2675472_ReadsPerGene.out.tab"
[12] "../../star/ERR2675473_ReadsPerGene.out.tab"
[13] "../../star/ERR2675476_ReadsPerGene.out.tab"
[14] "../../star/ERR2675477_ReadsPerGene.out.tab"
[15] "../../star/ERR2675478_ReadsPerGene.out.tab"
[16] "../../star/ERR2675479_ReadsPerGene.out.tab"
[17] "../../star/ERR2675480_ReadsPerGene.out.tab"
[18] "../../star/ERR2675481_ReadsPerGene.out.tab"
[19] "../../star/ERR2675484_ReadsPerGene.out.tab"
[20] "../../star/ERR2675485_ReadsPerGene.out.tab"
```

Import STAR count file and keep the 4th column containing the reverse stranded counts. Build a `countData` data.frame to store counts

```r
countData = data.frame(fread(files[1]))[c(1,4)]

# Loop and read the 4th column remaining files
for(i in 2:length(files)) {
        countData = cbind(countData, data.frame(fread(files[i]))[4])
}

# Skip first 4 lines, count data starts on the 5th line
countData = countData[c(5:nrow(countData)),]
colnames(countData) = c("GeneID", gsub(paste0(dir,"star/"), "", files))
colnames(countData) = gsub("_ReadsPerGene.out.tab", "", colnames(countData))
rownames(countData) = countData$GeneID

countData = countData[,c(2:ncol(countData))]
```

```r
> countData[1:10,1:5]
                  ERR2675454 ERR2675455 ERR2675458 ERR2675459 ERR2675460
ENSG00000223972.5          0          0          0          0          0
ENSG00000227232.5         32         33         13         70         97
ENSG00000278267.1          0          0          2          3          4
ENSG00000243485.5          0          0          0          0          0
ENSG00000284332.1          0          0          0          0          0
ENSG00000237613.2          0          0          0          0          0
ENSG00000268020.3          0          0          0          0          0
ENSG00000240361.2          0          0          0          0          0
ENSG00000186092.6          0          0          0          0          0
ENSG00000238009.6          0          2          0          0          9
```

Build a `DESeqDataSet` from `countData` with `DESeqDataSetFromMatrix`, providing also the sample information and a design formula

```r
dds = DESeqDataSetFromMatrix(countData = countData, 
colData = sampleData, design = ~ individual + paris_classification)
```

```r
> dds
class: DESeqDataSet 
dim: 60662 20 
metadata(1): version
assays(1): counts
rownames(60662): ENSG00000223972.5 ENSG00000227232.5 ... ENSG00000210195.2 ENSG00000210196.2
rowData names(0):
colnames(20): ERR2675454 ERR2675455 ... ERR2675484 ERR2675485
colData names(11): ENA_RUN individual ... braf_status kras_status
```

## Differential expression analysis

Run DESeq2 analysis using `DESeq`, which performs (1) estimation of size factors, (2) estimation of dispersion, then (3) Negative Binomial GLM fitting and Wald statistics. The results tables (log2 fold changes and p-values) can be generated using the `results` function

```r
dds = DESeq(dds)
```

```r
> cbind(resultsNames(dds))
      [,1]                                  
 [1,] "Intercept"                           
 [2,] "individual_S11_vs_S1"                
 [3,] "individual_S12_vs_S1"                
 [4,] "individual_S15_vs_S1"                
 [5,] "individual_S17_vs_S1"                
 [6,] "individual_S3_vs_S1"                 
 [7,] "individual_S5_vs_S1"                 
 [8,] "individual_S6_vs_S1"                 
 [9,] "individual_S7_vs_S1"                 
[10,] "individual_S9_vs_S1"                 
[11,] "paris_classification_0.IIa_vs_normal"
```

We set the adjusted p-value cutoff (FDR) to be **0.05**, hence we change the default significance cutoff used for optimizing the independent filtering `alpha` from 0.1 to 0.05

```r
res <- results(dds, name = "paris_classification_0.IIa_vs_normal", alpha = 0.05)
```

At a FDR of 0.05, we have 3129 genes up-regulated and 2916 genes down-regulated in 0-IIa pre-lesion versus Normal

```r
> summary(res)

out of 39903 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 3129, 7.8%
LFC < 0 (down)     : 2916, 7.3%
outliers [1]       : 0, 0%
low counts [2]     : 15801, 40%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
```

The results `res` object contains the follow columns

```r
> mcols(res)$description
[1] "mean of normalized counts for all samples"                   
[2] "log2 fold change (MLE): paris classification 0.IIa vs normal"
[3] "standard error: paris classification 0.IIa vs normal"        
[4] "Wald statistic: paris classification 0.IIa vs normal"        
[5] "Wald test p-value: paris classification 0.IIa vs normal"     
[6] "BH adjusted p-values"                                        

> head(res)
log2 fold change (MLE): paris classification 0.IIa vs normal 
Wald test p-value: paris classification 0.IIa vs normal 
DataFrame with 6 rows and 6 columns
                           baseMean      log2FoldChange             lfcSE                stat            pvalue              padj
                          <numeric>           <numeric>         <numeric>           <numeric>         <numeric>         <numeric>
ENSG00000223972.5 0.196142797581091 -0.0510133328786562  3.01727024534903 -0.0169071142889142 0.986510717197277                NA
ENSG00000227232.5  65.2360917381934  -0.148867737983724 0.348724795003796  -0.426891749931643  0.66945817485052 0.816108488708069
ENSG00000278267.1  2.73563541954875   0.668374526475476  0.80813859554197     0.8270543322179 0.408206266889902 0.618298738745137
ENSG00000243485.5                 0                  NA                NA                  NA                NA                NA
ENSG00000284332.1                 0                  NA                NA                  NA                NA                NA
ENSG00000237613.2                 0                  NA                NA                  NA                NA                NA
```

Read about authors' note on p-values and adjusted p-values set to NA [here](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#more-information-on-results-columns)

## Exploring results

#### MA-plot

We use MA-plot to show the log2 fold changes attributable to a given variable (i.e. pre-lesion) over the mean of normalized counts for all the samples. Points will be colored red if the adjusted p value is less than 0.05. Points which fall out of the window are plotted as open triangles pointing either up or down

```r
resLFC = lfcShrink(dds, coef = "paris_classification_0.IIa_vs_normal", 
type="apeglm")

png("DGE_MA-plot.STAR.png", width=7, height=5, units = "in", res = 300)
plotMA(resLFC, alpha = 0.05, ylim=c(-6,6), 
main = "MA-plot for the shrunken log2 fold changes")
dev.off()
```

![MA-plot](https://808500276-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M5W-32iOTUXPKbX19Gv%2F-M6YkphnXRAfrMlQKSTA%2F-M6_JaW17a7qWvQ2PAfM%2FDGE_MA-plot.STAR.png?alt=media\&token=04501b18-3fca-45c9-8f3e-1d24c0b051d0)

#### Principal component plot of the samples

First, we perform count data transformation with regularized logarithm `rlog` or variance stabilizing transformations `vst`. You can read [here](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#the-variance-stabilizing-transformation-and-the-rlog) about which transformation to choose

```r
rld = rlog(dds)
vsd = vst(dds)
```

Perform sample PCA for transformed data using `plotPCA`, then plot with `ggplot`

```r
# rlog
pcaData = plotPCA(rld, intgroup=c("individual","paris_classification"), 
returnData=TRUE)
percentVar = round(100 * attr(pcaData, "percentVar"))

png("DGE_PCA-rlog.STAR.png", width=7, height=7, units = "in", res = 300)
ggplot(pcaData, aes(PC1, PC2, colour = paris_classification)) + 
geom_point(size = 2) + theme_bw() + 
scale_color_manual(values = c("blue", "red")) +
geom_text_repel(aes(label = individual), nudge_x = -1, nudge_y = 0.2, size = 3) +
ggtitle("Principal Component Analysis (PCA)", subtitle = "rlog transformation") +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance"))
dev.off()
```

![PCA plot (rlog)](https://808500276-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M5W-32iOTUXPKbX19Gv%2F-M6YkphnXRAfrMlQKSTA%2F-M6_JbkKvAyC7_pdVsF9%2FDGE_PCA-rlog.STAR.png?alt=media\&token=e364eca2-c6e8-434b-8179-4939ec9ace7c)

```r
# vst
pcaData = plotPCA(vsd, intgroup=c("individual","paris_classification"), 
returnData=TRUE)
percentVar = round(100 * attr(pcaData, "percentVar"))

png("DGE_PCA-vst.STAR.png", width=7, height=7, units = "in", res = 300)
ggplot(pcaData, aes(PC1, PC2, colour = paris_classification)) + 
geom_point(size = 2) + theme_bw() + 
scale_color_manual(values = c("blue", "red")) +
geom_text_repel(aes(label = individual), nudge_x = -1, nudge_y = 0.2, size = 3) +
ggtitle("Principal Component Analysis (PCA)", subtitle = "vst transformation") +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance"))
dev.off()
```

![PCA plot (vst)](https://808500276-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M5W-32iOTUXPKbX19Gv%2F-M6YkphnXRAfrMlQKSTA%2F-M6_JbkLoiyDdbvgH904%2FDGE_PCA-vst.STAR.png?alt=media\&token=06ab9cdf-0dbd-445b-b140-b066ae208bed)

#### Volcano plots

We use `EnhancedVolcano` to generate volcano plot to visualise the results of differential expression analyses

```r
pCutoff = 0.05
FCcutoff = 1.0

p = EnhancedVolcano(data.frame(res), lab = NA, x = 'log2FoldChange', y = 'padj',
xlab = bquote(~Log[2]~ 'fold change'), ylab = bquote(~-Log[10]~adjusted~italic(P)),
pCutoff = pCutoff, FCcutoff = FCcutoff, pointSize = 1.0, labSize = 2.0,
title = "Volcano plot", subtitle = "SSA/P vs. Normal",
caption = paste0('log2 FC cutoff: ', FCcutoff, '; p-value cutoff: ', pCutoff, '\nTotal = ', nrow(res), ' variables'),
legend=c('NS','Log2 FC','Adjusted p-value', 'Adjusted p-value & Log2 FC'),
legendPosition = 'bottom', legendLabSize = 14, legendIconSize = 5.0)

png("DGE_VolcanoPlots.STAR.png", width=7, height=7, units = "in", res = 300)
print(p)
dev.off()
```

![Volcano plot](https://808500276-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M5W-32iOTUXPKbX19Gv%2F-M6YkphnXRAfrMlQKSTA%2F-M6_KgkGWz_aRZhTgMqD%2FDGE_VolcanoPlots.STAR.png?alt=media\&token=bbab6692-6fa1-4e94-a15e-6f4497a38f84)

## Exporting results

We construct a `normData` data.frame to store per-group normalised mean and normalised counts of all samples, and a `deData` data.frame to store the DESeq results. We merge both data.frame with a `annoData` data.frame that contain per-gene annotation derived from GENCODE GTF file before exporting the results in tabular format&#x20;

```r
annoData = "/home/USER/db/refanno/gencode.v33.annotation_genes.txt"
annoData = data.frame(fread(annoData))

normCounts = as.data.frame(counts(dds, normalized = TRUE))
baseMeans = as.data.frame(sapply( levels(dds$paris_classification), 
function(lvl) rowMeans( counts(dds, normalized = TRUE)[, dds$paris_classification == lvl, drop = FALSE] ) ))

normData = merge(annoData, merge(baseMeans, normCounts, by.x = 'row.names', by.y = 'row.names'), by.x = 'GeneID', by.y = 'Row.names')
normData = normData[order(normData$Chromosome, normData$Start, normData$End),]

deData = data.frame(res[,c(1,2,5,6)])
colnames(deData) = c("baseMean","log2fc","pvalue","padj")
deData = merge(annoData, deData, by.x = 'GeneID', by.y = 'row.names')
deData = deData[order(deData$Chromosome, deData$Start, deData$End),]

write.table(normData, file="DGE_DESeq2_Means_and_NormalisedCount.STAR.txt", 
sep = "\t", quote = F, row.names = F, col.names = T)
write.table(deData, file="DGE_DESeq2_DE_results.STAR.txt", 
sep = "\t", quote = F, row.names = F, col.names = T)
```

```r
> normData[10000:10005,1:12]
                  GeneID GeneSymbol Chromosome    Start      End                Class Strand Length      normal        0-IIa ERR2675454 ERR2675455
9184  ENSG00000149257.14   SERPINH1      chr11 75562056 75572783       protein_coding      +  10727 804.9909532 850.91390781 571.034788  817.82929
42777  ENSG00000255326.1 AP001922.5      chr11 75583196 75594665               lncRNA      +  11469   0.3676455   1.86683323   0.000000    0.00000
13087 ENSG00000171533.11       MAP6      chr11 75586918 75669120       protein_coding      -  82202  11.6539986  11.06950146   3.294431   18.99825
42876  ENSG00000255434.1 AP001922.6      chr11 75596144 75597270               lncRNA      -   1126   0.0000000   0.07527231   0.000000    0.00000
42172  ENSG00000254630.1 AP001922.2      chr11 75635883 75638587               lncRNA      -   2704   0.0000000   0.00000000   0.000000    0.00000
42735  ENSG00000255280.1 AP001922.4      chr11 75642600 75642889 processed_pseudogene      +    289   0.0000000   0.00000000   0.000000    0.00000

> deData[10000:10005,]
                  GeneID GeneSymbol Chromosome    Start      End                Class Strand Length     baseMean       log2fc    pvalue      padj
9184  ENSG00000149257.14   SERPINH1      chr11 75562056 75572783       protein_coding      +  10727 827.95243051  0.081912786 0.6410465 0.7998397
42777  ENSG00000255326.1 AP001922.5      chr11 75583196 75594665               lncRNA      +  11469   1.11723936  1.894908372 0.1279743        NA
13087 ENSG00000171533.11       MAP6      chr11 75586918 75669120       protein_coding      -  82202  11.36175003 -0.004785778 0.9900707 0.9957721
42876  ENSG00000255434.1 AP001922.6      chr11 75596144 75597270               lncRNA      -   1126   0.03763615  0.137222714 0.9638312        NA
42172  ENSG00000254630.1 AP001922.2      chr11 75635883 75638587               lncRNA      -   2704   0.00000000           NA        NA        NA
42735  ENSG00000255280.1 AP001922.4      chr11 75642600 75642889 processed_pseudogene      +    289   0.00000000           NA        NA        NA
```

## Session info

```r
> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /home/USER/miniconda3/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] EnhancedVolcano_1.4.0       ggrepel_0.8.2              
 [3] ggplot2_3.3.0               apeglm_1.8.0               
 [5] DESeq2_1.26.0               SummarizedExperiment_1.16.1
 [7] DelayedArray_0.12.3         BiocParallel_1.20.1        
 [9] matrixStats_0.56.0          Biobase_2.46.0             
[11] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
[13] IRanges_2.20.2              S4Vectors_0.24.4           
[15] BiocGenerics_0.32.0         data.table_1.12.8          

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.6.2          Formula_1.2-3         
 [4] assertthat_0.2.1       latticeExtra_0.6-29    blob_1.2.1            
 [7] GenomeInfoDbData_1.2.2 numDeriv_2016.8-1.1    pillar_1.4.4          
[10] RSQLite_2.2.0          backports_1.1.6        lattice_0.20-41       
[13] glue_1.4.0             bbmle_1.0.23.1         digest_0.6.25         
[16] RColorBrewer_1.1-2     XVector_0.26.0         checkmate_2.0.0       
[19] colorspace_1.4-1       plyr_1.8.6             htmltools_0.4.0       
[22] Matrix_1.2-18          XML_3.99-0.3           pkgconfig_2.0.3       
[25] emdbook_1.3.12         genefilter_1.68.0      zlibbioc_1.32.0       
[28] mvtnorm_1.1-0          purrr_0.3.4            xtable_1.8-4          
[31] scales_1.1.0           jpeg_0.1-8.1           htmlTable_1.13.3      
[34] tibble_3.0.1           annotate_1.64.0        farver_2.0.3          
[37] ellipsis_0.3.0         withr_2.2.0            nnet_7.3-14           
[40] survival_3.1-12        magrittr_1.5           crayon_1.3.4          
[43] memoise_1.1.0          MASS_7.3-51.6          foreign_0.8-76        
[46] tools_3.6.2            lifecycle_0.2.0        stringr_1.4.0         
[49] locfit_1.5-9.4         munsell_0.5.0          cluster_2.1.0         
[52] AnnotationDbi_1.48.0   compiler_3.6.2         rlang_0.4.6           
[55] grid_3.6.2             RCurl_1.98-1.2         rstudioapi_0.11       
[58] htmlwidgets_1.5.1      labeling_0.3           bitops_1.0-6          
[61] base64enc_0.1-3        gtable_0.3.0           DBI_1.1.0             
[64] R6_2.4.1               gridExtra_2.3          bdsmatrix_1.3-4       
[67] knitr_1.28             dplyr_0.8.5            bit_1.1-15.2          
[70] Hmisc_4.4-0            stringi_1.4.6          Rcpp_1.0.4.6          
[73] geneplotter_1.64.0     vctrs_0.2.4            rpart_4.1-15          
[76] acepack_1.4.1          png_0.1-7              coda_0.19-3           
[79] tidyselect_1.0.0       xfun_0.13             
```
