Softwares and databases

Reference genome

$ cd /home/USER/db/refanno

Obtain from GENCODE

* Latest stable version available at the time of writing

https://www.gencodegenes.org/human/release_33.html; Human v33 (GRCh38)

# Genome sequence, primary assembly
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/GRCh38.primary_assembly.genome.fa.gz
$ gunzip GRCh38.primary_assembly.genome.fa.gz
# Comprehensive gene annotation
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.annotation.gtf.gz
$ gunzip gencode.v33.annotation.gtf.gz
# Transcript sequences
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.transcripts.fa.gz
$ gunzip gencode.v33.transcripts.fa.gz

You can also retrieve mouse data from https://www.gencodegenes.org/mouse/; Mouse vM24 (GRCm38)

Clean-up fasta header

$ sed 's/|ENSG.*//' gencode.v33.transcripts.fa > gencode.v33.transcripts.clean.fa

Create tabular info from GTF

We will use awk and sed to convert the gene-level and transcript-level information stored in GTF file into tabular file for easy access

Obtaining softwares

You can place the executables or make symbolic links in a location recognisable by$PATH or make changes to$PATH to avoid needing to specify full paths when using the programs. All codes provided assume executables can be found in $PATH.

* Latest stable version available at the time of writing

FastQC

https://www.bioinformatics.babraham.ac.uk/projects/fastqc; v0.11.9

Download and unzip the file. Change fastqc permission from 664 (-rw-rw-r--) to 755 (-rwxr-xr-x)

BBTools

https://jgi.doe.gov/data-and-tools/bbtools; v38.82

Salmon

https://github.com/COMBINE-lab/salmon; v1.2.0

kallisto

https://github.com/pachterlab/kallisto; v0.46.2

Kallisto is phasing out HDF5 at the time of writing, binaries for this release (v0.46.2) are compiled with HDF5 built in

STAR

https://github.com/alexdobin/STAR; v2.7.3a

Download and unpack the file. Use make to compile STAR

RSEM

https://github.com/deweylab/RSEM; v1.3.3

By default, RSEM executables are installed to /usr/local/bin. You can change the installation location by setting DESTDIR and/or prefix variables. The RSEM executables will be installed to ${DESTDIR}${prefix}/bin. The default values of DESTDIR and prefix are DESTDIR= and prefix=/usr/local.

Preparing indices

Salmon index

SAF genome index (recommended)

  • Full genome is used as decoy, largde index but least prone to spurious alignments

  • Indexing takes under 40 minutes running on 6 threads, and the index size is 17 GB

Alternatively, cDNA-only index

  • No decoy used, small index but most prone to possible spurious alignments

  • Indexing takes under 3 minutes running on 6 threads, and the index size is 782 MB

kallisto index

  • Use header-cleaned transcript file to create index

  • Indexing takes under 6 minutes running on a single thread, and the index size is 3.0G

STAR index

The --sjdbOverhang option is set as 150 in this tutorial. This length should be equal to the ReadLength-1of the fastq files. The STAR manual suggests the ideal value is max(ReadLength)-1 in case of reads of varying length. In most cases, the default value of 100 will work as well as the ideal value.

Indexing takes about 70 minutes running on 6 threads, and the index size is 28 GB

RSEM index

Indexing takes under 2 minutes running on 6 threads, and the index size is 1.6G

Last updated

Was this helpful?