Quantification using RSEM

In this tutorial, we use RSEM to quantify the expression of genes and transcript. In the previous step, we instruct STAR to output genomic alignments in transcriptomic coordinates (i.e. Aligned.toTranscriptome.out.bam). We input this file to RSEM to produce gene and transcript expression levels.

Usage

rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name
rsem-calculate-expression [options] --paired-end upstream_read_file(s) downstream_read_file(s) reference_name sample_name
rsem-calculate-expression [options] --sam/--bam [--paired-end] input reference_name sample_name

Execute

cd ~/LSLNGS2015/
mkdir RNASEQ_data/rsem_GM12878_rep1 RNASEQ_data/rsem_GM12878_rep2

/home/ycl6/tools/RSEM/rsem-calculate-expression --bam --no-bam-output -p 10 \
--paired-end --forward-prob 0 \
RNASEQ_data/star_GM12878_rep1/Aligned.toTranscriptome.out.bam \
GENOME_data/rsem/GRCh38 RNASEQ_data/rsem_GM12878_rep1/rsem >& \
RNASEQ_data/rsem_GM12878_rep1/rsem.log

/home/ycl6/tools/RSEM/rsem-calculate-expression --bam --no-bam-output -p 10 \
--paired-end --forward-prob 0 \
RNASEQ_data/star_GM12878_rep2/Aligned.toTranscriptome.out.bam \
GENOME_data/rsem/GRCh38 RNASEQ_data/rsem_GM12878_rep2/rsem >& \
RNASEQ_data/rsem_GM12878_rep2/rsem.log

mkdir RNASEQ_data/rsem_K562_rep1 RNASEQ_data/rsem_K562_rep2

/home/ycl6/tools/RSEM/rsem-calculate-expression --bam --no-bam-output -p 10 \
--paired-end --forward-prob 0 \
RNASEQ_data/star_K562_rep1/Aligned.toTranscriptome.out.bam \
GENOME_data/rsem/GRCh38 RNASEQ_data/rsem_K562_rep1/rsem >& \
RNASEQ_data/rsem_K562_rep1/rsem.log

/home/ycl6/tools/RSEM/rsem-calculate-expression --bam --no-bam-output -p 10 \
--paired-end --forward-prob 0 \
RNASEQ_data/star_K562_rep2/Aligned.toTranscriptome.out.bam \
GENOME_data/rsem/GRCh38 RNASEQ_data/rsem_K562_rep2/rsem >& \
RNASEQ_data/rsem_K562_rep2/rsem.log

Resource usage

Options

--bam Input file is in BAM format.

--no-bam-output Do not output any BAM file.

-p Number of threads to use.

--paired-end Input reads are paired-end reads.

--forward-prob Probability of generating a read from the forward strand of a transcript. 1: strand-specific protocol where all (upstream) reads are derived from the forward strand; 0: strand-specific protocol where all (upstream) read are derived from the reverse strand; 0.5: non-strand-specific protocol.

Output

RSEM generates 2 result files: 1. rsem.genes.results 2. rsem.isoforms.results.

We can use head to examine these files.

head ~/LSLNGS2015/RNASEQ_data/rsem_GM12878_rep1/rsem.genes.results

gene_id transcript_id(s)        length  effective_length        expected_count  TPM     FPKM
ENSG00000000003 ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008 2206.00 1997.20 1.00    0.00    0.01
ENSG00000000005 ENST00000373031,ENST00000485971 940.50  731.73  0.00    0.00    0.00
ENSG00000000419 ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752 977.15  768.35  1865.00 14.27   37.82
ENSG00000000457 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238 3779.11 3570.31 1521.00 2.50    6.64
ENSG00000000460 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289 1936.74 1727.94 1860.00 6.33    16.77
ENSG00000000938 ENST00000374003,ENST00000374004,ENST00000374005,ENST00000399173,ENST00000457296,ENST00000468038,ENST00000475472 2020.10 1811.30 6846.00 22.22   58.90
ENSG00000000971 ENST00000359637,ENST00000367429,ENST00000466229,ENST00000470918,ENST00000496761,ENST00000630130 2587.83 2379.04 0.00    0.00    0.00
ENSG00000001036 ENST00000002165,ENST00000367585,ENST00000451668 1912.64 1703.85 1358.00 4.69    12.42
ENSG00000001084 ENST00000229416,ENST00000504353,ENST00000504525,ENST00000505197,ENST00000505294,ENST00000509541,ENST00000510837,ENST00000513939,ENST00000514004,ENST00000514373,ENST00000514933,ENST00000515580,ENST00000616923      2333.50 2124.73 1178.00 3.26    8.64

head ~/LSLNGS2015/RNASEQ_data/rsem_GM12878_rep1/rsem.isoforms.results

transcript_id   gene_id length  effective_length        expected_count  TPM     FPKM    IsoPct
ENST00000373020 ENSG00000000003 2206    1997.20 1.00    0.00    0.01    100.00
ENST00000494424 ENSG00000000003 820     611.20  0.00    0.00    0.00    0.00
ENST00000496771 ENSG00000000003 1025    816.20  0.00    0.00    0.00    0.00
ENST00000612152 ENSG00000000003 3796    3587.20 0.00    0.00    0.00    0.00
ENST00000614008 ENSG00000000003 900     691.20  0.00    0.00    0.00    0.00
ENST00000373031 ENSG00000000005 1339    1130.20 0.00    0.00    0.00    0.00
ENST00000485971 ENSG00000000005 542     333.26  0.00    0.00    0.00    0.00
ENST00000371582 ENSG00000000419 1161    952.20  16.56   0.10    0.27    0.72
ENST00000371584 ENSG00000000419 1073    864.20  8.54    0.06    0.15    0.41

Check number of lines in all result files with wc -l

wc -l RNASEQ_data/rsem_*/*results

   58052 RNASEQ_data/rsem_GM12878_rep1/rsem.genes.results
  198003 RNASEQ_data/rsem_GM12878_rep1/rsem.isoforms.results
   58052 RNASEQ_data/rsem_GM12878_rep2/rsem.genes.results
  198003 RNASEQ_data/rsem_GM12878_rep2/rsem.isoforms.results
   58052 RNASEQ_data/rsem_K562_rep1/rsem.genes.results
  198003 RNASEQ_data/rsem_K562_rep1/rsem.isoforms.results
   58052 RNASEQ_data/rsem_K562_rep2/rsem.genes.results
  198003 RNASEQ_data/rsem_K562_rep2/rsem.isoforms.results

Prepare gene-level and transcript-level expression matrices

We use paste command to join the rsem.genes.results files side-by-side, then use cut to select the columns containing the expected_count information, and place them into a final output file. Repeat the same step for isoforms.

This one-line command assumes the genes (and transcripts) in each files are in the same order. If they are not, you will have to sort the files before joining them together.

cd ~/LSLNGS2015/

paste RNASEQ_data/rsem_*/rsem.genes.results | tail -n+2 | \
cut -f1,5,12,19,26 > RNASEQ_data/edgeR.genes.rsem.txt

paste RNASEQ_data/rsem_*/rsem.isoforms.results | tail -n+2 | \
cut -f1,5,13,21,29 > RNASEQ_data/edgeR.isoforms.rsem.txt

Let us look at the content of the created data matrices

Gene-level: head RNASEQ_data/edgeR.genes.rsem.txt

ENSG00000000003 1.00    0.00    3.00    2.00
ENSG00000000005 0.00    0.00    0.00    0.00
ENSG00000000419 1865.00 1951.00 5909.00 8163.00
ENSG00000000457 1521.00 1488.00 849.00  1400.00
ENSG00000000460 1860.00 1616.00 2577.00 2715.00
ENSG00000000938 6846.00 5298.00 1.00    2.00
ENSG00000000971 0.00    0.00    6159.00 7069.00
ENSG00000001036 1358.00 1186.00 6196.00 7009.00
ENSG00000001084 1178.00 1186.00 631.00  1293.00
ENSG00000001167 2247.00 1835.00 2887.00 3155.00

Transcript-level: head RNASEQ_data/edgeR.isoforms.rsem.txt

ENST00000373020 1.00    0.00    2.88    2.00
ENST00000494424 0.00    0.00    0.12    0.00
ENST00000496771 0.00    0.00    0.00    0.00
ENST00000612152 0.00    0.00    0.00    0.00
ENST00000614008 0.00    0.00    0.00    0.00
ENST00000373031 0.00    0.00    0.00    0.00
ENST00000485971 0.00    0.00    0.00    0.00
ENST00000371582 16.56   53.45   530.00  620.89
ENST00000371584 8.54    44.25   234.93  368.14
ENST00000371588 1562.96 1853.30 5069.81 7081.71

Last updated