Obtain data and software
Last updated
Was this helpful?
Last updated
Was this helpful?
We will not run through the commands in this section LIVE on ALPS1 in this class, because (1) there is not enough time and (2) each user account has only 10 Gb. However, in theory, you can run the whole thing on /work3
when you create your folder in it. Please try at your own time before your test account expires. You can also try the whole tutorial on your own Linux system with the required softwares installed.
Also, if you are using ALPS1, remember to use the bsub
command to submit your jobs to the LSF system!
We download the human genome FASTA sequences and annotation GTF file from the Ensembl FTP.
Filename
Size
Homo_sapiens.GRCh38.86.gtf.gz
44M
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
841M
cd ~/
mkdir LSLNGS2015
mkdir LSLNGS2015/GENOME_data LSLNGS2015/RNASEQ_data
cd ~/LSLNGS2015/GENOME_data
wget http://ftp.ensembl.org/pub/release-86/gtf/homo_sapiens/Homo_sapiens.GRCh38.86.gtf.gz
gunzip Homo_sapiens.GRCh38.86.gtf.gz
wget http://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Below, we use a combination of commands to convert annotations recorded in the GTF file into BED format.
cd ~/LSLNGS2015/GENOME_data
grep -P "\tgene\t" Homo_sapiens.GRCh38.86.gtf | cut -f1,4,5,7,9 | \
sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$6,".",$4,$10,$12,$14 }' | \
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.86.gene.bed
grep -P "\ttranscript\t" Homo_sapiens.GRCh38.86.gtf | cut -f1,4,5,7,9 | \
sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$10,".",$4,$14,$16,$18 }' | \
sort -k1,1 -k2,2n > Homo_sapiens.GRCh38.86.transcript.bed
Use head
to view the BED files
head Homo_sapiens.GRCh38.86.gene.bed
1 11868 14409 ENSG00000223972 . + DDX11L1 havana transcribed_unprocessed_pseudogene
1 14403 29570 ENSG00000227232 . - WASH7P havana unprocessed_pseudogene
1 17368 17436 ENSG00000278267 . - MIR6859-1 mirbase miRNA
1 29553 31109 ENSG00000243485 . + MIR1302-2 ensembl_havana lincRNA
1 34553 36081 ENSG00000237613 . - FAM138A havana lincRNA
1 52472 53312 ENSG00000268020 . + OR4G4P havana unprocessed_pseudogene
1 62947 63887 ENSG00000240361 . + OR4G11P havana unprocessed_pseudogene
1 69090 70008 ENSG00000186092 . + OR4F5 ensembl_havana protein_coding
1 89294 133723 ENSG00000238009 . - RP11-34P13.7 ensembl_havana lincRNA
1 89550 91105 ENSG00000239945 . - RP11-34P13.8 havana lincRNA
head Homo_sapiens.GRCh38.86.transcript.bed
1 11868 14409 ENST00000456328 . + DDX11L1 havana transcribed_unprocessed_pseudogene
1 12009 13670 ENST00000450305 . + DDX11L1 havana transcribed_unprocessed_pseudogene
1 14403 29570 ENST00000488147 . - WASH7P havana unprocessed_pseudogene
1 17368 17436 ENST00000619216 . - MIR6859-1 mirbase miRNA
1 29553 31097 ENST00000473358 . + MIR1302-2 ensembl_havana lincRNA
1 30266 31109 ENST00000469289 . + MIR1302-2 ensembl_havana lincRNA
1 30365 30503 ENST00000607096 . + MIR1302-2 ensembl_havana lincRNA
1 34553 36081 ENST00000417324 . - FAM138A havana lincRNA
1 35244 36073 ENST00000461467 . - FAM138A havana lincRNA
1 52472 53312 ENST00000606857 . + OR4G4P havana unprocessed_pseudogene
Next, we download the RNA-Seq data of two adult female cell lines, GM12878 (ENCSR000AEC) and K562 (ENCSR000AEM), from the ENCODE website. The experiment were performed with 2 replicates and they are stranded PE101 Illumina Hi-Seq RNA-Seq libraries from rRNA-depleted Poly-A+ RNA.
Filename
Sample ID
Size
GM12878.rep1.R1.fastq.gz
ENCFF001RFH
7.9G
GM12878.rep1.R2.fastq.gz
ENCFF001RFG
8.0G
GM12878.rep2.R1.fastq.gz
ENCFF001RFB
6.9G
GM12878.rep2.R2.fastq.gz
ENCFF001RFA
7.1G
K562.rep1.R1.fastq.gz
ENCFF001RED
7.2G
K562.rep1.R2.fastq.gz
ENCFF001RDZ
7.4G
K562.rep2.R1.fastq.gz
ENCFF001REG
8.8G
K562.rep2.R2.fastq.gz
ENCFF001REF
9.1G
cd ~/LSLNGS2015/RNASEQ_data
wget https://www.encodeproject.org/files/ENCFF001RFH/@@download/ENCFF001RFH.fastq.gz -O GM12878.rep1.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RFG/@@download/ENCFF001RFG.fastq.gz -O GM12878.rep1.R2.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RFB/@@download/ENCFF001RFB.fastq.gz -O GM12878.rep2.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RFA/@@download/ENCFF001RFA.fastq.gz -O GM12878.rep2.R2.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RED/@@download/ENCFF001RED.fastq.gz -O K562.rep1.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001RDZ/@@download/ENCFF001RDZ.fastq.gz -O K562.rep1.R2.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001REG/@@download/ENCFF001REG.fastq.gz -O K562.rep2.R1.fastq.gz
wget https://www.encodeproject.org/files/ENCFF001REF/@@download/ENCFF001REF.fastq.gz -O K562.rep2.R2.fastq.gz
v2.5.2b [20 Aug 2016] - Latest version available at the time of writing and used in this exercise
v2.3.0e [14 Feb 2013] - Latest version available on ALPS1
v1.3.0 [02 Oct 2016] - Latest version available at the time of writing
v1.2.31 [04 Jun 2016] - Version used in this exercise
v1.2.19 [05 Nov 2014] - Latest version available on ALPS1
v1.3.1 [22 Apr 2016] - Latest version available at the time of writing and used in this exercise
v1.2 [02 Feb 2015] - Latest version available on ALPS1