In this tutorial, we use RSEM to quantify the expression of genes and transcript. In the previous step, we instruct STAR to output genomic alignments in transcriptomic coordinates (i.e. Aligned.toTranscriptome.out.bam
). We input this file to RSEM to produce gene and transcript expression levels.
Usage
Copy rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name
rsem-calculate-expression [options] --paired-end upstream_read_file(s) downstream_read_file(s) reference_name sample_name
rsem-calculate-expression [options] --sam/--bam [--paired-end] input reference_name sample_name
Execute
Copy cd ~/LSLNGS2015/
mkdir RNASEQ_data/rsem_GM12878_rep1 RNASEQ_data/rsem_GM12878_rep2
/home/ycl6/tools/RSEM/rsem-calculate-expression --bam --no-bam-output -p 10 \
--paired-end --forward-prob 0 \
RNASEQ_data/star_GM12878_rep1/Aligned.toTranscriptome.out.bam \
GENOME_data/rsem/GRCh38 RNASEQ_data/rsem_GM12878_rep1/rsem >& \
RNASEQ_data/rsem_GM12878_rep1/rsem.log
/home/ycl6/tools/RSEM/rsem-calculate-expression --bam --no-bam-output -p 10 \
--paired-end --forward-prob 0 \
RNASEQ_data/star_GM12878_rep2/Aligned.toTranscriptome.out.bam \
GENOME_data/rsem/GRCh38 RNASEQ_data/rsem_GM12878_rep2/rsem >& \
RNASEQ_data/rsem_GM12878_rep2/rsem.log
mkdir RNASEQ_data/rsem_K562_rep1 RNASEQ_data/rsem_K562_rep2
/home/ycl6/tools/RSEM/rsem-calculate-expression --bam --no-bam-output -p 10 \
--paired-end --forward-prob 0 \
RNASEQ_data/star_K562_rep1/Aligned.toTranscriptome.out.bam \
GENOME_data/rsem/GRCh38 RNASEQ_data/rsem_K562_rep1/rsem >& \
RNASEQ_data/rsem_K562_rep1/rsem.log
/home/ycl6/tools/RSEM/rsem-calculate-expression --bam --no-bam-output -p 10 \
--paired-end --forward-prob 0 \
RNASEQ_data/star_K562_rep2/Aligned.toTranscriptome.out.bam \
GENOME_data/rsem/GRCh38 RNASEQ_data/rsem_K562_rep2/rsem >& \
RNASEQ_data/rsem_K562_rep2/rsem.log
Resource usage
6 hours 21 minutes 55 seconds
6 hours 41 minutes 16 seconds
Options
--bam
Input file is in BAM format.
--no-bam-output
Do not output any BAM file.
-p
Number of threads to use.
--paired-end
Input reads are paired-end reads.
--forward-prob
Probability of generating a read from the forward strand of a transcript. 1: strand-specific protocol where all (upstream) reads are derived from the forward strand; 0: strand-specific protocol where all (upstream) read are derived from the reverse strand; 0.5: non-strand-specific protocol.
Output
RSEM generates 2 result files: 1. rsem.genes.results 2. rsem.isoforms.results.
We can use head
to examine these files.
head ~/LSLNGS2015/RNASEQ_data/rsem_GM12878_rep1/rsem.genes.results
Copy gene_id transcript_id(s) length effective_length expected_count TPM FPKM
ENSG00000000003 ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008 2206.00 1997.20 1.00 0.00 0.01
ENSG00000000005 ENST00000373031,ENST00000485971 940.50 731.73 0.00 0.00 0.00
ENSG00000000419 ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752 977.15 768.35 1865.00 14.27 37.82
ENSG00000000457 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238 3779.11 3570.31 1521.00 2.50 6.64
ENSG00000000460 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289 1936.74 1727.94 1860.00 6.33 16.77
ENSG00000000938 ENST00000374003,ENST00000374004,ENST00000374005,ENST00000399173,ENST00000457296,ENST00000468038,ENST00000475472 2020.10 1811.30 6846.00 22.22 58.90
ENSG00000000971 ENST00000359637,ENST00000367429,ENST00000466229,ENST00000470918,ENST00000496761,ENST00000630130 2587.83 2379.04 0.00 0.00 0.00
ENSG00000001036 ENST00000002165,ENST00000367585,ENST00000451668 1912.64 1703.85 1358.00 4.69 12.42
ENSG00000001084 ENST00000229416,ENST00000504353,ENST00000504525,ENST00000505197,ENST00000505294,ENST00000509541,ENST00000510837,ENST00000513939,ENST00000514004,ENST00000514373,ENST00000514933,ENST00000515580,ENST00000616923 2333.50 2124.73 1178.00 3.26 8.64
head ~/LSLNGS2015/RNASEQ_data/rsem_GM12878_rep1/rsem.isoforms.results
Copy transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
ENST00000373020 ENSG00000000003 2206 1997.20 1.00 0.00 0.01 100.00
ENST00000494424 ENSG00000000003 820 611.20 0.00 0.00 0.00 0.00
ENST00000496771 ENSG00000000003 1025 816.20 0.00 0.00 0.00 0.00
ENST00000612152 ENSG00000000003 3796 3587.20 0.00 0.00 0.00 0.00
ENST00000614008 ENSG00000000003 900 691.20 0.00 0.00 0.00 0.00
ENST00000373031 ENSG00000000005 1339 1130.20 0.00 0.00 0.00 0.00
ENST00000485971 ENSG00000000005 542 333.26 0.00 0.00 0.00 0.00
ENST00000371582 ENSG00000000419 1161 952.20 16.56 0.10 0.27 0.72
ENST00000371584 ENSG00000000419 1073 864.20 8.54 0.06 0.15 0.41
Check number of lines in all result files with wc -l
wc -l RNASEQ_data/rsem_*/*results
Copy 58052 RNASEQ_data/rsem_GM12878_rep1/rsem.genes.results
198003 RNASEQ_data/rsem_GM12878_rep1/rsem.isoforms.results
58052 RNASEQ_data/rsem_GM12878_rep2/rsem.genes.results
198003 RNASEQ_data/rsem_GM12878_rep2/rsem.isoforms.results
58052 RNASEQ_data/rsem_K562_rep1/rsem.genes.results
198003 RNASEQ_data/rsem_K562_rep1/rsem.isoforms.results
58052 RNASEQ_data/rsem_K562_rep2/rsem.genes.results
198003 RNASEQ_data/rsem_K562_rep2/rsem.isoforms.results
Prepare gene-level and transcript-level expression matrices
We use paste
command to join the rsem.genes.results files side-by-side, then use cut
to select the columns containing the expected_count information, and place them into a final output file. Repeat the same step for isoforms.
This one-line command assumes the genes (and transcripts) in each files are in the same order. If they are not, you will have to sort the files before joining them together.
Copy cd ~/LSLNGS2015/
paste RNASEQ_data/rsem_*/rsem.genes.results | tail -n+2 | \
cut -f1,5,12,19,26 > RNASEQ_data/edgeR.genes.rsem.txt
paste RNASEQ_data/rsem_*/rsem.isoforms.results | tail -n+2 | \
cut -f1,5,13,21,29 > RNASEQ_data/edgeR.isoforms.rsem.txt
Let us look at the content of the created data matrices
Gene-level: head RNASEQ_data/edgeR.genes.rsem.txt
Copy ENSG00000000003 1.00 0.00 3.00 2.00
ENSG00000000005 0.00 0.00 0.00 0.00
ENSG00000000419 1865.00 1951.00 5909.00 8163.00
ENSG00000000457 1521.00 1488.00 849.00 1400.00
ENSG00000000460 1860.00 1616.00 2577.00 2715.00
ENSG00000000938 6846.00 5298.00 1.00 2.00
ENSG00000000971 0.00 0.00 6159.00 7069.00
ENSG00000001036 1358.00 1186.00 6196.00 7009.00
ENSG00000001084 1178.00 1186.00 631.00 1293.00
ENSG00000001167 2247.00 1835.00 2887.00 3155.00
Transcript-level: head RNASEQ_data/edgeR.isoforms.rsem.txt
Copy ENST00000373020 1.00 0.00 2.88 2.00
ENST00000494424 0.00 0.00 0.12 0.00
ENST00000496771 0.00 0.00 0.00 0.00
ENST00000612152 0.00 0.00 0.00 0.00
ENST00000614008 0.00 0.00 0.00 0.00
ENST00000373031 0.00 0.00 0.00 0.00
ENST00000485971 0.00 0.00 0.00 0.00
ENST00000371582 16.56 53.45 530.00 620.89
ENST00000371584 8.54 44.25 234.93 368.14
ENST00000371588 1562.96 1853.30 5069.81 7081.71