Softwares and databases
Reference genome
$ cd /home/USER/db/refannoObtain from GENCODE
https://www.gencodegenes.org/human/release_33.html; Human v33 (GRCh38)
# Genome sequence, primary assembly
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/GRCh38.primary_assembly.genome.fa.gz
$ gunzip GRCh38.primary_assembly.genome.fa.gz# Comprehensive gene annotation
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.annotation.gtf.gz
$ gunzip gencode.v33.annotation.gtf.gz# Transcript sequences
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.transcripts.fa.gz
$ gunzip gencode.v33.transcripts.fa.gzYou can also retrieve mouse data from https://www.gencodegenes.org/mouse/; Mouse vM24 (GRCm38)
Clean-up fasta header
$ sed 's/|ENSG.*//' gencode.v33.transcripts.fa > gencode.v33.transcripts.clean.faCreate tabular info from GTF
We will use awk and sed to convert the gene-level and transcript-level information stored in GTF file into tabular file for easy access
Obtaining softwares
FastQC
https://www.bioinformatics.babraham.ac.uk/projects/fastqc; v0.11.9
Download and unzip the file. Change fastqc permission from 664 (-rw-rw-r--) to 755 (-rwxr-xr-x)
BBTools
https://jgi.doe.gov/data-and-tools/bbtools; v38.82
Salmon
https://github.com/COMBINE-lab/salmon; v1.2.0
kallisto
https://github.com/pachterlab/kallisto; v0.46.2
STAR
https://github.com/alexdobin/STAR; v2.7.3a
Download and unpack the file. Use make to compile STAR
RSEM
https://github.com/deweylab/RSEM; v1.3.3
By default, RSEM executables are installed to /usr/local/bin. You can change the installation location by setting DESTDIR and/or prefix variables. The RSEM executables will be installed to ${DESTDIR}${prefix}/bin. The default values of DESTDIR and prefix are DESTDIR= and prefix=/usr/local.
Preparing indices
Salmon index
SAF genome index (recommended)
Full genome is used as decoy, largde index but least prone to spurious alignments
Indexing takes under 40 minutes running on 6 threads, and the index size is 17 GB
Alternatively, cDNA-only index
No decoy used, small index but most prone to possible spurious alignments
Indexing takes under 3 minutes running on 6 threads, and the index size is 782 MB
kallisto index
Use header-cleaned transcript file to create index
Indexing takes under 6 minutes running on a single thread, and the index size is 3.0G
STAR index
The --sjdbOverhang option is set as 150 in this tutorial. This length should be equal to the ReadLength-1of the fastq files. The STAR manual suggests the ideal value is max(ReadLength)-1 in case of reads of varying length. In most cases, the default value of 100 will work as well as the ideal value.
Indexing takes about 70 minutes running on 6 threads, and the index size is 28 GB
RSEM index
Indexing takes under 2 minutes running on 6 threads, and the index size is 1.6G
Last updated
Was this helpful?