DTU analysis using DRIMSeq
Load required libraries
library(data.table)
library(GenomicFeatures)
library(tximport)
library(DRIMSeq)
library(stageR)
library(dplyr)
library(reshape2)
library(ggplot2)
library(ggbeeswarm)Import input data
Load clinical information and create a simple sample data.frame samps
dir = "../../"
sampleData = paste0(dir, "clinical.txt")
sampleData = fread(sampleData)
rownames(sampleData) = sampleData$ENA_RUN
sampleData$individual = as.factor(sampleData$individual)
sampleData$paris_classification = as.factor(sampleData$paris_classification)
# Use relevel() to set adjacent normal samples as reference
sampleData$paris_classification = relevel(sampleData$paris_classification, ref = "normal")
samps = data.frame(sample_id = sampleData$ENA_RUN, group = sampleData$paris_classification)Load TxDB and construct a two column data.frame txdf with the transcript and gene identifiers
Define file path and import Salmon NumReads values with tximport which contains the transcript-level counts (txIn = TRUE), abundances and average transcript lengths, then generate transcript-level estimated counts (txOut = TRUE) using the dtuScaledTPM method
Buld a data.frame cts containing the estimated counts and remove rows that has zero counts across all samples
The experiments have between 17.6 and 36.2 million paired-end reads that were mapped to the transcriptome using Salmon
DRIMSeq procedure
Create a data.frame counts containing the gene_id (Gene ID) and feature_id (Transcript ID) information from subsetted txdf and estimated counts from cts
Then create a dmDSdata object d
We will filter the object using dmFilter before running procedures to estimate model parameters. This speeds up the fitting and removes transcripts that may be troublesome for parameter estimation. We first define n to be the total number of samples, and n.small to be the sample size of the smallest group (in this dataset, both groups have 10 samples). All three of the possible filters are implemented.
A transcript has a count of at least 10 in at least
n.smallsamplesA transcript has a relative abundance proportion of at least 0.1 in at least
n.smallsamplesThe total count of the corresponding gene is at least 10 in all
n samples
The number of genes in d reduced from 37333 to 7467. After filtering, the object d only contains genes that have more than one isoform, which makes sense as we are testing for differential transcript usage. We can find out how many of the remaining genes have N isoforms by tabulating the number of times we see a gene ID, then tabulating the output again
Create a design matrix, using a design formula and sample information from samps. We then use the following three functions to estimate the model parameters and test for DTU.
dmPrecision: estimate the precision (higher dispersion is associated with lower precision)
dmFit: fit regression coefficients
dmTest: perform null hypothesis testing on the coefficient of interest.
Here, we test the coefficient associated with the difference between group 0-IIa and normal
The system.time was used to time the analysis and it takes about 15 minutes
Use results function to build test result tables. The res.g contains a single p-value per gene, which tests whether there is any differential transcript usage within the gene, and res.t contains a single p-value per transcript, which tests whether the proportions for this transcript changed within the gene
Use is.na to check if the pvalue column contain NA values, in this case, there are none
DRIMSeq test identified 125 genes showing evidence of isoform switching involving 199 transcripts
Post-hoc filtering on the standard deviation in proportions
This post-hoc filtering step is not part of the DRIMSeq modeling steps but suggested in Michael Love's DTU workflow to improved the false discovery rate (FDR) and overall false discovery rate (OFDR) control
The smallProportionSD set the p-values and adjusted p-values for transcripts with small per-sample proportion SD to 1
Of the 19179 transcripts, 6504 have small per-sample proportion
There are 199 transcripts found to be involved in switching before filtering and 193 after filtering
stageR procedure
Set up a function to strip away the version numbers in the Ensembl gene and transcript IDs
Construct a vector of per-gene p-values for the screening stage
Construct a one column matrix of the per-transcript confirmation p-values
The res.t is used twice to construct a 4-column data.frame that contain both original IDs and IDs without version numbers
Construct a stageRTx object and perform the stageR analysis. We specify alpha to be 0.05
The data.frame drim.padj summarizes the adjusted p-values from the two-stage analysis. Only genes that passed the filter are included in the table. There are 125 screened genes in this dataset, and 187 transcripts pass the confirmation stage on a target 5% overall false discovery rate (OFDR). This means that, in expectation, no more than 5% of the genes that pass screening will either (1) not contain any DTU, so be falsely screened genes, or (2) contain a falsely confirmed transcript.
Exporting results
We construct a drimData data.frame to store per-group average proportion and per-sample proportion of transcripts, and a drimDTU data.frame to store the DRIMSeq+stageR results. We merge both data.frame with a annoData data.frame that contain per-transcript annotation derived from GENCODE GTF file before exporting the results in tabular format
Write results to files
Exploring results
We can use the DRIMSeq's plotProportions to plot the transcript proportions of a given gene. In the example below, we select a gene from the data.frame drim.padj that has smallest transcript and gene adjusted p-values

We can also use ggplot2 to plot similar figure with data from drim.prop

Session info
Last updated
Was this helpful?