Obtain data and software
We will not run through the commands in this section LIVE on ALPS1 in this class, because (1) there is not enough time and (2) each user account has only 10 Gb. However, in theory, you can run the whole thing on /work3 when you create your folder in it. Please try at your own time before your test account expires. You can also try the whole tutorial on your own Linux system with the required softwares installed.
Also, if you are using ALPS1, remember to use the bsub command to submit your jobs to the LSF system!
ALPS1 work3 folder

Genome sequence and annotation

We download the human genome FASTA sequences and annotation GTF file from the Ensembl FTP.
Filename
Size
Homo_sapiens.GRCh38.86.gtf.gz
44M
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
841M
1
cd ~/
2
mkdir LSLNGS2015
3
mkdir LSLNGS2015/GENOME_data LSLNGS2015/RNASEQ_data
4
5
cd ~/LSLNGS2015/GENOME_data
6
7
wget http://ftp.ensembl.org/pub/release-86/gtf/homo_sapiens/Homo_sapiens.GRCh38.86.gtf.gz
8
gunzip Homo_sapiens.GRCh38.86.gtf.gz
9
10
wget http://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
11
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Copied!

Create annotations in BED format

Below, we use a combination of commands to convert annotations recorded in the GTF file into BED format.
1
cd ~/LSLNGS2015/GENOME_data
2
3
grep -P "\tgene\t" Homo_sapiens.GRCh38.86.gtf | cut -f1,4,5,7,9 | \
4
sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
5
awk -F #x27;\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$6,".",$4,$10,$12,$14 }' | \
6
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.86.gene.bed
7
8
grep -P "\ttranscript\t" Homo_sapiens.GRCh38.86.gtf | cut -f1,4,5,7,9 | \
9
sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
10
awk -F #x27;\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$10,".",$4,$14,$16,$18 }' | \
11
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.86.transcript.bed
Copied!
Use head to view the BED files
head Homo_sapiens.GRCh38.86.gene.bed
1
1 11868 14409 ENSG00000223972 . + DDX11L1 havana transcribed_unprocessed_pseudogene
2
1 14403 29570 ENSG00000227232 . - WASH7P havana unprocessed_pseudogene
3
1 17368 17436 ENSG00000278267 . - MIR6859-1 mirbase miRNA
4
1 29553 31109 ENSG00000243485 . + MIR1302-2 ensembl_havana lincRNA
5
1 34553 36081 ENSG00000237613 . - FAM138A havana lincRNA
6
1 52472 53312 ENSG00000268020 . + OR4G4P havana unprocessed_pseudogene
7
1 62947 63887 ENSG00000240361 . + OR4G11P havana unprocessed_pseudogene
8
1 69090 70008 ENSG00000186092 . + OR4F5 ensembl_havana protein_coding
9
1 89294 133723 ENSG00000238009 . - RP11-34P13.7 ensembl_havana lincRNA
10
1 89550 91105 ENSG00000239945 . - RP11-34P13.8 havana lincRNA
Copied!
head Homo_sapiens.GRCh38.86.transcript.bed
1
1 11868 14409 ENST00000456328 . + DDX11L1 havana transcribed_unprocessed_pseudogene
2
1 12009 13670 ENST00000450305 . + DDX11L1 havana transcribed_unprocessed_pseudogene
3
1 14403 29570 ENST00000488147 . - WASH7P havana unprocessed_pseudogene
4
1 17368 17436 ENST00000619216 . - MIR6859-1 mirbase miRNA
5
1 29553 31097 ENST00000473358 . + MIR1302-2 ensembl_havana lincRNA
6
1 30266 31109 ENST00000469289 . + MIR1302-2 ensembl_havana lincRNA
7
1 30365 30503 ENST00000607096 . + MIR1302-2 ensembl_havana lincRNA
8
1 34553 36081 ENST00000417324 . - FAM138A havana lincRNA
9
1 35244 36073 ENST00000461467 . - FAM138A havana lincRNA
10
1 52472 53312 ENST00000606857 . + OR4G4P havana unprocessed_pseudogene
Copied!

RNA-seq raw reads

Next, we download the RNA-Seq data of two adult female cell lines, GM12878 (ENCSR000AEC) and K562 (ENCSR000AEM), from the ENCODE website. The experiment were performed with 2 replicates and they are stranded PE101 Illumina Hi-Seq RNA-Seq libraries from rRNA-depleted Poly-A+ RNA.
Filename
Sample ID
Size
GM12878.rep1.R1.fastq.gz
ENCFF001RFH
7.9G
GM12878.rep1.R2.fastq.gz
ENCFF001RFG
8.0G
GM12878.rep2.R1.fastq.gz
ENCFF001RFB
6.9G
GM12878.rep2.R2.fastq.gz
ENCFF001RFA
7.1G
K562.rep1.R1.fastq.gz
ENCFF001RED
7.2G
K562.rep1.R2.fastq.gz
ENCFF001RDZ
7.4G
K562.rep2.R1.fastq.gz
ENCFF001REG
8.8G
K562.rep2.R2.fastq.gz
ENCFF001REF
9.1G
1
cd ~/LSLNGS2015/RNASEQ_data
2
3
wget https://www.encodeproject.org/files/ENCFF001RFH/@@download/ENCFF001RFH.fastq.gz -O GM12878.rep1.R1.fastq.gz
4
wget https://www.encodeproject.org/files/ENCFF001RFG/@@download/ENCFF001RFG.fastq.gz -O GM12878.rep1.R2.fastq.gz
5
6
wget https://www.encodeproject.org/files/ENCFF001RFB/@@download/ENCFF001RFB.fastq.gz -O GM12878.rep2.R1.fastq.gz
7
wget https://www.encodeproject.org/files/ENCFF001RFA/@@download/ENCFF001RFA.fastq.gz -O GM12878.rep2.R2.fastq.gz
8
9
wget https://www.encodeproject.org/files/ENCFF001RED/@@download/ENCFF001RED.fastq.gz -O K562.rep1.R1.fastq.gz
10
wget https://www.encodeproject.org/files/ENCFF001RDZ/@@download/ENCFF001RDZ.fastq.gz -O K562.rep1.R2.fastq.gz
11
12
wget https://www.encodeproject.org/files/ENCFF001REG/@@download/ENCFF001REG.fastq.gz -O K562.rep2.R1.fastq.gz
13
wget https://www.encodeproject.org/files/ENCFF001REF/@@download/ENCFF001REF.fastq.gz -O K562.rep2.R2.fastq.gz
Copied!

Software

STAR (Spliced Transcripts Alignment to a Reference)

    v2.5.2b [20 Aug 2016] - Latest version available at the time of writing and used in this exercise
    v2.3.0e [14 Feb 2013] - Latest version available on ALPS1

RSEM (RNA-Seq by Expectation-Maximization)

    v1.3.0 [02 Oct 2016] - Latest version available at the time of writing
    v1.2.31 [04 Jun 2016] - Version used in this exercise
    v1.2.19 [05 Nov 2014] - Latest version available on ALPS1

SAMtools

    v1.3.1 [22 Apr 2016] - Latest version available at the time of writing and used in this exercise
    v1.2 [02 Feb 2015] - Latest version available on ALPS1
Last modified 1yr ago