Load clinical information and create a simple sample data.frame samps
dir ="../../"sampleData =paste0(dir, "clinical.txt")sampleData =fread(sampleData)rownames(sampleData) = sampleData$ENA_RUNsampleData$individual =as.factor(sampleData$individual)sampleData$paris_classification =as.factor(sampleData$paris_classification)# Use relevel() to set adjacent normal samples as referencesampleData$paris_classification =relevel(sampleData$paris_classification, ref ="normal")samps =data.frame(sample_id = sampleData$ENA_RUN, group = sampleData$paris_classification)
Define file path and import Salmon NumReads values with tximport which contains the transcript-level counts (txIn = TRUE), abundances and average transcript lengths, then generate transcript-level estimated counts (txOut = TRUE) using the dtuScaledTPM method
Create a data.frame counts containing the gene_id (Gene ID) and feature_id (Transcript ID) information from subsetted txdf and estimated counts from cts
> dAn object of class dmDSdata with 37333 genes and 20 samples* data accessors:counts(), samples()
We will filter the object using dmFilter before running procedures to estimate model parameters. This speeds up the fitting and removes transcripts that may be troublesome for parameter estimation. We first define n to be the total number of samples, and n.small to be the sample size of the smallest group (in this dataset, both groups have 10 samples). All three of the possible filters are implemented.
A transcript has a count of at least 10 in at least n.small samples
A transcript has a relative abundance proportion of at least 0.1 in at least n.small samples
The total count of the corresponding gene is at least 10 in all n samples
n =nrow(samps)n.small =min(table(samps$group))d =dmFilter(d, min_samps_feature_expr = n.small, min_feature_expr =10, min_samps_feature_prop = n.small, min_feature_prop =0.1, min_samps_gene_expr = n, min_gene_expr =10)
The number of genes in d reduced from 37333 to 7467. After filtering, the object d only contains genes that have more than one isoform, which makes sense as we are testing for differential transcript usage. We can find out how many of the remaining genes have N isoforms by tabulating the number of times we see a gene ID, then tabulating the output again
> dAn object of class dmDSdata with 7467 genes and 20 samples* data accessors:counts(), samples()>table(table(counts(d)$gene_id))234564281228675213711
Create a design matrix, using a design formula and sample information from samps. We then use the following three functions to estimate the model parameters and test for DTU.
dmPrecision: estimate the precision (higher dispersion is associated with lower precision)
dmFit: fit regression coefficients
dmTest: perform null hypothesis testing on the coefficient of interest.
Here, we test the coefficient associated with the difference between group 0-IIa and normal
design =model.matrix(~group, data = samps)set.seed(12345)system.time({ d <-dmPrecision(d, design = design)# Estimate the precision (Higher dispersion is associated with lower precision) d <-dmFit(d, design = design)# Fit regression coefficients d <-dmTest(d, coef ="group0-IIa")# Perform null hypothesis testing on the coefficient of interest})
The system.time was used to time the analysis and it takes about 15 minutes
! Using a subset of 0.1 genes to estimate common precision !! Using common_precision =4.7973 as prec_init !! Using 1.150196 as a shrinkage factor ! user system elapsed 900.0638.770862.508
Use results function to build test result tables. The res.g contains a single p-value per gene, which tests whether there is any differential transcript usage within the gene, and res.t contains a single p-value per transcript, which tests whether the proportions for this transcript changed within the gene
# Single p-value per generes.g = DRIMSeq::results(d)# Single p-value per transcriptres.t = DRIMSeq::results(d, level ="feature")
Use is.na to check if the pvalue column contain NA values, in this case, there are none
Post-hoc filtering on the standard deviation in proportions
This post-hoc filtering step is not part of the DRIMSeq modeling steps but suggested in Michael Love's DTU workflow to improved the false discovery rate (FDR) and overall false discovery rate (OFDR) control
The smallProportionSD set the p-values and adjusted p-values for transcripts with small per-sample proportion SD to 1
smallProportionSD<-function(d, filter =0.1) {# Generate count table cts =as.matrix(subset(counts(d), select =-c(gene_id, feature_id)))# Summarise count total per gene gene.cts =rowsum(cts, counts(d)$gene_id)# Use total count per gene as count per transcript total.cts = gene.cts[match(counts(d)$gene_id, rownames(gene.cts)),]# Calculate proportion of transcript in gene props = cts/total.ctsrownames(props) = rownames(total.cts)# Calculate standard deviation propSD =sqrt(rowVars(props))# Check if standard deviation of per-sample proportions is < 0.1 propSD < filter}filt =smallProportionSD(d)res.t.filt = DRIMSeq::results(d, level ="feature")res.t.filt$pvalue[filt] = 1res.t.filt$adj_pvalue[filt] = 1
Of the 19179 transcripts, 6504 have small per-sample proportion
>table(filt)filtFALSETRUE122906504
There are 199 transcripts found to be involved in switching before filtering and 193 after filtering
The data.frame drim.padj summarizes the adjusted p-values from the two-stage analysis. Only genes that passed the filter are included in the table. There are 125 screened genes in this dataset, and 187 transcripts pass the confirmation stage on a target 5% overall false discovery rate (OFDR). This means that, in expectation, no more than 5% of the genes that pass screening will either (1) not contain any DTU, so be falsely screened genes, or (2) contain a falsely confirmed transcript.
We construct a drimData data.frame to store per-group average proportion and per-sample proportion of transcripts, and a drimDTU data.frame to store the DRIMSeq+stageR results. We merge both data.frame with a annoData data.frame that contain per-transcript annotation derived from GENCODE GTF file before exporting the results in tabular format
annoData ="/home/USER/db/refanno/gencode.v33.annotation_transcripts.txt"annoData =data.frame(fread(annoData))# Create a data.frame containing counts in long-format data with reshape2::meltdrim.prop = reshape2::melt(counts[counts$feature_id %in%proportions(d)$feature_id,], id =c("gene_id", "feature_id"))drim.prop = drim.prop[order(drim.prop$gene_id, drim.prop$variable, drim.prop$feature_id),]# Calculate proportions from countssystem.time({drim.prop = drim.prop %>%group_by(gene_id, variable)%>%mutate(total =sum(value)) %>%group_by(variable, add=TRUE)%>%mutate(prop = value/total)})# Convert the data.frame to wide-format data with reshape2::dcastdrim.prop = reshape2::dcast(drim.prop[,c(1,2,3,6)], gene_id + feature_id ~ variable)
We can use the DRIMSeq's plotProportions to plot the transcript proportions of a given gene. In the example below, we select a gene from the data.frame drim.padj that has smallest transcript and gene adjusted p-values
# Plot the estimated proportions for one of the significant genes, where we can see evidence of switchinggene_id =unique(drim.padj[order(drim.padj$transcript, drim.padj$gene),]$gene_id.1)[1]png("plotProportions.DRIMSeq-stageR.1.png", width=6, height=6, units ="in", res =300)plotProportions(d, gene_id = gene_id, group_variable ="group", plot_type ="boxplot1")dev.off()
We can also use ggplot2 to plot similar figure with data from drim.prop
# We will use this function in both DRIMSeq & DEXSeq workflowsplotExpression<-function(expData =NULL, geneID =NULL, samps =NULL, isProportion =FALSE) {colnames(expData)[1:2] = c("gid","tid") sub =subset(expData, gid == geneID) sub = reshape2::melt(sub, id =c("gid", "tid")) sub =merge(samps, sub, by.x ="sample_id", by.y ="variable")if(!isProportion) { sub$value =log(sub$value) } clrs =c("dodgerblue3", "maroon2", "forestgreen", "darkorange1", "blueviolet", "firebrick2","deepskyblue", "orchid2", "chartreuse3", "gold", "slateblue1", "tomato" , "blue", "magenta", "green3","yellow", "purple3", "red" ,"darkslategray1", "lightpink1", "lightgreen", "khaki1", "plum3", "salmon") p =ggplot(sub, aes(tid, value, color = group, fill = group))+geom_boxplot(alpha =0.4, outlier.shape =NA, width =0.8, lwd =0.5)+stat_summary(fun = mean, geom ="point", color ="black", shape =5, size =3, position=position_dodge(width =0.8))+scale_color_manual(values = clrs)+scale_fill_manual(values = clrs)+geom_quasirandom(color ="black", size =1, dodge.width =0.8)+theme_bw()+ggtitle(geneID)+xlab("Transcripts")if(!isProportion) { p = p +ylab("log(Expression)") } else { p = p +ylab("Proportions") } p}
gene_id =unique(drim.padj[order(drim.padj$transcript, drim.padj$gene),]$gene_id.1)[1]png("plotProportions.DRIMSeq-stageR.2.png", width=6, height=6, units ="in", res =300)plotExpression(drim.prop, gene_id, samps, isProportion =TRUE)dev.off()