Gene-level RNA-Seq Data Analysis (lagacy)
  • Introduction
  • RNA-Seq Analysis Workflow
    • Login to server
    • Obtain data and software
    • Create mapping indices
    • Mapping with STAR
    • Quantification using RSEM
  • De novo assembly using Trinity
    • De novo assembly of RNA-Seq reads
    • Compare de novo reconstructed transcripts to reference annotations
    • Quantification using RSEM
  • Differential expression analysis using R
    • Install R libraries
    • Perform DE analysis
    • Perform DE analysis (Trinity)
    • Extracting DE transcripts and generating heatmaps (Trinity)
  • Visualization
Powered by GitBook
On this page
  • Genome sequence and annotation
  • RNA-seq raw reads
  • Software
  • STAR (Spliced Transcripts Alignment to a Reference)
  • RSEM (RNA-Seq by Expectation-Maximization)
  • SAMtools

Was this helpful?

  1. RNA-Seq Analysis Workflow

Obtain data and software

PreviousLogin to serverNextCreate mapping indices

Last updated 5 years ago

Was this helpful?

We will not run through the commands in this section LIVE on ALPS1 in this class, because (1) there is not enough time and (2) each user account has only 10 Gb. However, in theory, you can run the whole thing on /work3 when you create your folder in it. Please try at your own time before your test account expires. You can also try the whole tutorial on your own Linux system with the required softwares installed.

Also, if you are using ALPS1, remember to use the bsub command to submit your jobs to the LSF system!

Genome sequence and annotation

We download the human genome FASTA sequences and annotation GTF file from the Ensembl FTP.

Filename

Size

Homo_sapiens.GRCh38.86.gtf.gz

44M

Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

841M

cd ~/
mkdir LSLNGS2015
mkdir LSLNGS2015/GENOME_data LSLNGS2015/RNASEQ_data

cd ~/LSLNGS2015/GENOME_data

wget http://ftp.ensembl.org/pub/release-86/gtf/homo_sapiens/Homo_sapiens.GRCh38.86.gtf.gz
gunzip Homo_sapiens.GRCh38.86.gtf.gz

wget http://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Create annotations in BED format

Below, we use a combination of commands to convert annotations recorded in the GTF file into BED format.

cd ~/LSLNGS2015/GENOME_data

grep -P "\tgene\t" Homo_sapiens.GRCh38.86.gtf | cut -f1,4,5,7,9 | \
sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$6,".",$4,$10,$12,$14 }' | \
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.86.gene.bed

grep -P "\ttranscript\t" Homo_sapiens.GRCh38.86.gtf | cut -f1,4,5,7,9 | \
sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$10,".",$4,$14,$16,$18 }' | \
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.86.transcript.bed

Use head to view the BED files

head Homo_sapiens.GRCh38.86.gene.bed

1       11868   14409   ENSG00000223972 .       +       DDX11L1 havana  transcribed_unprocessed_pseudogene
1       14403   29570   ENSG00000227232 .       -       WASH7P  havana  unprocessed_pseudogene
1       17368   17436   ENSG00000278267 .       -       MIR6859-1       mirbase miRNA
1       29553   31109   ENSG00000243485 .       +       MIR1302-2       ensembl_havana  lincRNA
1       34553   36081   ENSG00000237613 .       -       FAM138A havana  lincRNA
1       52472   53312   ENSG00000268020 .       +       OR4G4P  havana  unprocessed_pseudogene
1       62947   63887   ENSG00000240361 .       +       OR4G11P havana  unprocessed_pseudogene
1       69090   70008   ENSG00000186092 .       +       OR4F5   ensembl_havana  protein_coding
1       89294   133723  ENSG00000238009 .       -       RP11-34P13.7    ensembl_havana  lincRNA
1       89550   91105   ENSG00000239945 .       -       RP11-34P13.8    havana  lincRNA

head Homo_sapiens.GRCh38.86.transcript.bed

1       11868   14409   ENST00000456328 .       +       DDX11L1 havana  transcribed_unprocessed_pseudogene
1       12009   13670   ENST00000450305 .       +       DDX11L1 havana  transcribed_unprocessed_pseudogene
1       14403   29570   ENST00000488147 .       -       WASH7P  havana  unprocessed_pseudogene
1       17368   17436   ENST00000619216 .       -       MIR6859-1       mirbase miRNA
1       29553   31097   ENST00000473358 .       +       MIR1302-2       ensembl_havana  lincRNA
1       30266   31109   ENST00000469289 .       +       MIR1302-2       ensembl_havana  lincRNA
1       30365   30503   ENST00000607096 .       +       MIR1302-2       ensembl_havana  lincRNA
1       34553   36081   ENST00000417324 .       -       FAM138A havana  lincRNA
1       35244   36073   ENST00000461467 .       -       FAM138A havana  lincRNA
1       52472   53312   ENST00000606857 .       +       OR4G4P  havana  unprocessed_pseudogene

RNA-seq raw reads

Filename

Sample ID

Size

GM12878.rep1.R1.fastq.gz

ENCFF001RFH

7.9G

GM12878.rep1.R2.fastq.gz

ENCFF001RFG

8.0G

GM12878.rep2.R1.fastq.gz

ENCFF001RFB

6.9G

GM12878.rep2.R2.fastq.gz

ENCFF001RFA

7.1G

K562.rep1.R1.fastq.gz

ENCFF001RED

7.2G

K562.rep1.R2.fastq.gz

ENCFF001RDZ

7.4G

K562.rep2.R1.fastq.gz

ENCFF001REG

8.8G

K562.rep2.R2.fastq.gz

ENCFF001REF

9.1G

cd ~/LSLNGS2015/RNASEQ_data

wget https://www.encodeproject.org/files/ENCFF001RFH/@@download/ENCFF001RFH.fastq.gz -O GM12878.rep1.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RFG/@@download/ENCFF001RFG.fastq.gz -O GM12878.rep1.R2.fastq.gz

wget https://www.encodeproject.org/files/ENCFF001RFB/@@download/ENCFF001RFB.fastq.gz -O GM12878.rep2.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RFA/@@download/ENCFF001RFA.fastq.gz -O GM12878.rep2.R2.fastq.gz

wget https://www.encodeproject.org/files/ENCFF001RED/@@download/ENCFF001RED.fastq.gz -O K562.rep1.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RDZ/@@download/ENCFF001RDZ.fastq.gz -O K562.rep1.R2.fastq.gz

wget https://www.encodeproject.org/files/ENCFF001REG/@@download/ENCFF001REG.fastq.gz -O K562.rep2.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001REF/@@download/ENCFF001REF.fastq.gz -O K562.rep2.R2.fastq.gz

Software

  • v2.5.2b [20 Aug 2016] - Latest version available at the time of writing and used in this exercise

  • v2.3.0e [14 Feb 2013] - Latest version available on ALPS1

  • v1.3.0 [02 Oct 2016] - Latest version available at the time of writing

  • v1.2.31 [04 Jun 2016] - Version used in this exercise

  • v1.2.19 [05 Nov 2014] - Latest version available on ALPS1

  • v1.3.1 [22 Apr 2016] - Latest version available at the time of writing and used in this exercise

  • v1.2 [02 Feb 2015] - Latest version available on ALPS1

Next, we download the RNA-Seq data of two adult female cell lines, GM12878 () and K562 (), from the ENCODE website. The experiment were performed with 2 replicates and they are stranded PE101 Illumina Hi-Seq RNA-Seq libraries from rRNA-depleted Poly-A+ RNA.

(Spliced Transcripts Alignment to a Reference)

(RNA-Seq by Expectation-Maximization)

ENCSR000AEC
ENCSR000AEM
STAR
RSEM
SAMtools
ALPS1 work3 folder