DTU analysis using DEXSeq

Load required libraries

library(data.table)
library(GenomicFeatures)
library(tximport)
library(DEXSeq)
library(stageR)
library(reshape2)
library(ggplot2)
library(ggbeeswarm)

Import input data

Load clinical information and create a simple sample data.frame samps

dir = "../../"
sampleData = paste0(dir, "clinical.txt")
sampleData = fread(sampleData)
rownames(sampleData) = sampleData$ENA_RUN
sampleData$individual = as.factor(sampleData$individual)
sampleData$paris_classification = as.factor(sampleData$paris_classification)

# Use relevel() to set adjacent normal samples as reference
sampleData$paris_classification = relevel(sampleData$paris_classification, ref = "normal")
samps = data.frame(sample_id = sampleData$ENA_RUN, group = sampleData$paris_classification)

Load TxDB and construct a two column data.frame txdf with the transcript and gene identifiers

Define file path and import Salmon NumReads values with tximport which contains the transcript-level counts (txIn = TRUE), abundances and average transcript lengths, then generate transcript-level estimated counts (txOut = TRUE) using the dtuScaledTPM method

Buld a data.frame cts containing the estimated counts and remove rows that has zero counts across all samples

The experiments have between 17.6 and 36.2 million paired-end reads that were mapped to the transcriptome using Salmon

Pre-filter with DRIMSeq dmFilter

We will make use of the dmFilter in DRIMSeq to remove genes which had expression too low to have very much statistical power for detecting DTU, and transcripts which were very lowly expressed in both conditions, and so not contributing useful information for DTU

Create a data.frame counts containing the gene_id (Gene ID) and feature_id (Transcript ID) information from subsetted txdf and estimated counts from cts

Then create a dmDSdata object d

We will filter the object using dmFilter before running DEXSeq workflow

DEXSeq procedure

Create a count table from the filtered d and build a DEXSeqDataSet from countData, providing the sample information, a design formula, transcript ID and gene ID

Performs DEXSeq analysis using the following 3 functions: (1) estimation of size factors, (2) estimation of dispersion, then (3) perform a likelihood ratio test for differential exon usage (in this case, transcripts). The results tables (log2 fold changes and p-values) can be generated using the DEXSeqResults function

Compute per-gene adjusted p-value using perGeneQValue, which aggregates evidence from multiple tests within a gene to a single p-value for the gene and then corrects for multiple testing across genes. Construct gene-level and transcript-level statistics

DEXSeq test identified 92 genes showing evidence of isoform switching involving 186 transcripts

stageR procedure

Set up a function to strip away the version numbers in the Ensembl gene and transcript IDs

Construct a vector of per-gene p-values for the screening stage. Per-gene adjusted p-value qval is used here

Construct a one column matrix of the per-transcript confirmation p-values

The dxr.t is used twice to construct a 4-column data.frame that contain both original IDs and IDs without version numbers

Construct a stageRTx object and perform the stageR analysis. We specify alpha to be 0.05

The data.frame dex.padj summarizes the adjusted p-values from the two-stage analysis. Only genes that passed the filter are included in the table. There are 92 screened genes in this dataset, and 149 transcripts pass the confirmation stage on a target 5% overall false discovery rate (OFDR). This means that, in expectation, no more than 5% of the genes that pass screening will either (1) not contain any DTU, so be falsely screened genes, or (2) contain a falsely confirmed transcript.

Exporting results

We construct a dexData data.frame to store per-group normalised mean and normalised counts of all samples, and a dexDTU data.frame to store the DEXSeq+stageR results. We merge both data.frame with a annoData data.frame that contain per-transcript annotation derived from GENCODE GTF file before exporting the results in tabular format

Write results to files

Exploring results

We will use the plotExpression to plot the transcript expression of a given gene. In the example below, we select a gene from the data.frame dex.padj that has smallest transcript and gene adjusted p-values, and with data from dex.norm

Expression switches among two isoforms of ENSG00000167460.17

Session info

Last updated

Was this helpful?