Obtain data and software

We will not run through the commands in this section LIVE on ALPS1 in this class, because (1) there is not enough time and (2) each user account has only 10 Gb. However, in theory, you can run the whole thing on /work3 when you create your folder in it. Please try at your own time before your test account expires. You can also try the whole tutorial on your own Linux system with the required softwares installed.

Also, if you are using ALPS1, remember to use the bsub command to submit your jobs to the LSF system!

Genome sequence and annotation

We download the human genome FASTA sequences and annotation GTF file from the Ensembl FTP.

Filename

Size

Homo_sapiens.GRCh38.86.gtf.gz

44M

Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

841M

cd ~/
mkdir LSLNGS2015
mkdir LSLNGS2015/GENOME_data LSLNGS2015/RNASEQ_data

cd ~/LSLNGS2015/GENOME_data

wget http://ftp.ensembl.org/pub/release-86/gtf/homo_sapiens/Homo_sapiens.GRCh38.86.gtf.gz
gunzip Homo_sapiens.GRCh38.86.gtf.gz

wget http://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Create annotations in BED format

Below, we use a combination of commands to convert annotations recorded in the GTF file into BED format.

cd ~/LSLNGS2015/GENOME_data

grep -P "\tgene\t" Homo_sapiens.GRCh38.86.gtf | cut -f1,4,5,7,9 | \
sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$6,".",$4,$10,$12,$14 }' | \
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.86.gene.bed

grep -P "\ttranscript\t" Homo_sapiens.GRCh38.86.gtf | cut -f1,4,5,7,9 | \
sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$10,".",$4,$14,$16,$18 }' | \
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.86.transcript.bed

Use head to view the BED files

head Homo_sapiens.GRCh38.86.gene.bed

1       11868   14409   ENSG00000223972 .       +       DDX11L1 havana  transcribed_unprocessed_pseudogene
1       14403   29570   ENSG00000227232 .       -       WASH7P  havana  unprocessed_pseudogene
1       17368   17436   ENSG00000278267 .       -       MIR6859-1       mirbase miRNA
1       29553   31109   ENSG00000243485 .       +       MIR1302-2       ensembl_havana  lincRNA
1       34553   36081   ENSG00000237613 .       -       FAM138A havana  lincRNA
1       52472   53312   ENSG00000268020 .       +       OR4G4P  havana  unprocessed_pseudogene
1       62947   63887   ENSG00000240361 .       +       OR4G11P havana  unprocessed_pseudogene
1       69090   70008   ENSG00000186092 .       +       OR4F5   ensembl_havana  protein_coding
1       89294   133723  ENSG00000238009 .       -       RP11-34P13.7    ensembl_havana  lincRNA
1       89550   91105   ENSG00000239945 .       -       RP11-34P13.8    havana  lincRNA

head Homo_sapiens.GRCh38.86.transcript.bed

1       11868   14409   ENST00000456328 .       +       DDX11L1 havana  transcribed_unprocessed_pseudogene
1       12009   13670   ENST00000450305 .       +       DDX11L1 havana  transcribed_unprocessed_pseudogene
1       14403   29570   ENST00000488147 .       -       WASH7P  havana  unprocessed_pseudogene
1       17368   17436   ENST00000619216 .       -       MIR6859-1       mirbase miRNA
1       29553   31097   ENST00000473358 .       +       MIR1302-2       ensembl_havana  lincRNA
1       30266   31109   ENST00000469289 .       +       MIR1302-2       ensembl_havana  lincRNA
1       30365   30503   ENST00000607096 .       +       MIR1302-2       ensembl_havana  lincRNA
1       34553   36081   ENST00000417324 .       -       FAM138A havana  lincRNA
1       35244   36073   ENST00000461467 .       -       FAM138A havana  lincRNA
1       52472   53312   ENST00000606857 .       +       OR4G4P  havana  unprocessed_pseudogene

RNA-seq raw reads

Next, we download the RNA-Seq data of two adult female cell lines, GM12878 (ENCSR000AEC) and K562 (ENCSR000AEM), from the ENCODE website. The experiment were performed with 2 replicates and they are stranded PE101 Illumina Hi-Seq RNA-Seq libraries from rRNA-depleted Poly-A+ RNA.

Filename

Sample ID

Size

GM12878.rep1.R1.fastq.gz

ENCFF001RFH

7.9G

GM12878.rep1.R2.fastq.gz

ENCFF001RFG

8.0G

GM12878.rep2.R1.fastq.gz

ENCFF001RFB

6.9G

GM12878.rep2.R2.fastq.gz

ENCFF001RFA

7.1G

K562.rep1.R1.fastq.gz

ENCFF001RED

7.2G

K562.rep1.R2.fastq.gz

ENCFF001RDZ

7.4G

K562.rep2.R1.fastq.gz

ENCFF001REG

8.8G

K562.rep2.R2.fastq.gz

ENCFF001REF

9.1G

cd ~/LSLNGS2015/RNASEQ_data

wget https://www.encodeproject.org/files/ENCFF001RFH/@@download/ENCFF001RFH.fastq.gz -O GM12878.rep1.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RFG/@@download/ENCFF001RFG.fastq.gz -O GM12878.rep1.R2.fastq.gz

wget https://www.encodeproject.org/files/ENCFF001RFB/@@download/ENCFF001RFB.fastq.gz -O GM12878.rep2.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RFA/@@download/ENCFF001RFA.fastq.gz -O GM12878.rep2.R2.fastq.gz

wget https://www.encodeproject.org/files/ENCFF001RED/@@download/ENCFF001RED.fastq.gz -O K562.rep1.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RDZ/@@download/ENCFF001RDZ.fastq.gz -O K562.rep1.R2.fastq.gz

wget https://www.encodeproject.org/files/ENCFF001REG/@@download/ENCFF001REG.fastq.gz -O K562.rep2.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001REF/@@download/ENCFF001REF.fastq.gz -O K562.rep2.R2.fastq.gz

Software

STAR (Spliced Transcripts Alignment to a Reference)

  • v2.5.2b [20 Aug 2016] - Latest version available at the time of writing and used in this exercise

  • v2.3.0e [14 Feb 2013] - Latest version available on ALPS1

RSEM (RNA-Seq by Expectation-Maximization)

  • v1.3.0 [02 Oct 2016] - Latest version available at the time of writing

  • v1.2.31 [04 Jun 2016] - Version used in this exercise

  • v1.2.19 [05 Nov 2014] - Latest version available on ALPS1

  • v1.3.1 [22 Apr 2016] - Latest version available at the time of writing and used in this exercise

  • v1.2 [02 Feb 2015] - Latest version available on ALPS1

Last updated