Obtain data and software
Last updated
Was this helpful?
Last updated
Was this helpful?
We will not run through the commands in this section LIVE on ALPS1 in this class, because (1) there is not enough time and (2) each user account has only 10 Gb. However, in theory, you can run the whole thing on /work3
when you create your folder in it. Please try at your own time before your test account expires. You can also try the whole tutorial on your own Linux system with the required softwares installed.
Also, if you are using ALPS1, remember to use the bsub
command to submit your jobs to the LSF system!
We download the human genome FASTA sequences and annotation GTF file from the Ensembl FTP.
Filename
Size
Homo_sapiens.GRCh38.86.gtf.gz
44M
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
841M
Below, we use a combination of commands to convert annotations recorded in the GTF file into BED format.
Use head
to view the BED files
head Homo_sapiens.GRCh38.86.gene.bed
head Homo_sapiens.GRCh38.86.transcript.bed
Filename
Sample ID
Size
GM12878.rep1.R1.fastq.gz
ENCFF001RFH
7.9G
GM12878.rep1.R2.fastq.gz
ENCFF001RFG
8.0G
GM12878.rep2.R1.fastq.gz
ENCFF001RFB
6.9G
GM12878.rep2.R2.fastq.gz
ENCFF001RFA
7.1G
K562.rep1.R1.fastq.gz
ENCFF001RED
7.2G
K562.rep1.R2.fastq.gz
ENCFF001RDZ
7.4G
K562.rep2.R1.fastq.gz
ENCFF001REG
8.8G
K562.rep2.R2.fastq.gz
ENCFF001REF
9.1G
v2.5.2b [20 Aug 2016] - Latest version available at the time of writing and used in this exercise
v2.3.0e [14 Feb 2013] - Latest version available on ALPS1
v1.3.0 [02 Oct 2016] - Latest version available at the time of writing
v1.2.31 [04 Jun 2016] - Version used in this exercise
v1.2.19 [05 Nov 2014] - Latest version available on ALPS1
v1.3.1 [22 Apr 2016] - Latest version available at the time of writing and used in this exercise
v1.2 [02 Feb 2015] - Latest version available on ALPS1
Next, we download the RNA-Seq data of two adult female cell lines, GM12878 () and K562 (), from the ENCODE website. The experiment were performed with 2 replicates and they are stranded PE101 Illumina Hi-Seq RNA-Seq libraries from rRNA-depleted Poly-A+ RNA.