Create mapping indices
Before we can perform NGS read mapping, we will create the genome indices using the genome FASTA file as input. You can re-use these indices in all your future RNA-seq mapping. However, if you wish to map to a different genome build/assembly, you have to re-run this step using different genome sequences and save the indices in a different directory. Also, you might also need to build new indices if using a newer version of these software, always check the relavent README or CHANGELOG.
Here, we will create indices for STAR and RSEM.

STAR

Usage

STAR --runMode genomeGenerate --genomeDir path_to_genomedir --genomeFastaFiles reference_fasta_file(s)

Execute

1
cd ~/LSLNGS2015
2
mkdir GENOME_data/star
3
4
STAR --runThreadN 40 --runMode genomeGenerate --genomeDir GENOME_data/star \
5
--genomeFastaFiles GENOME_data/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
6
--sjdbGTFfile GENOME_data/Homo_sapiens.GRCh38.86.gtf
Copied!

Resource usage

ALPS Queue Name
CPU Time
Max Memory
Duration
128G
51610.02 sec.
31 GB
1 hour 15 minutes 6 seconds

Options

--runThreadN defines the number of threads to be used for genome generation.
--runMode genomeGenerate directs STAR to run genome indices generation job.
--genomeDir path to the directory where the genome indices are stored. This directory has to be created (with mkdir) before STAR run and needs to writing permissions. The file system needs to have at least 100GB of disk space available for a typical mammalian genome.
--genomeFastaFiles one or more FASTA files with the genome reference sequences.
--sjdbGTFfile path to the transcript annotation in the standard GTF format.

Take a look at the STAR indices generated

ls -la ~/LSLNGS2015/GENOME_data/star
1
-rw-rw-r-- 1 ycl6 ycl6 1200 Oct 25 10:32 chrLength.txt
2
-rw-rw-r-- 1 ycl6 ycl6 3123 Oct 25 10:32 chrNameLength.txt
3
-rw-rw-r-- 1 ycl6 ycl6 1923 Oct 25 10:32 chrName.txt
4
-rw-rw-r-- 1 ycl6 ycl6 2129 Oct 25 10:32 chrStart.txt
5
-rw-rw-r-- 1 ycl6 ycl6 41854837 Oct 25 11:22 exonGeTrInfo.tab
6
-rw-rw-r-- 1 ycl6 ycl6 16985258 Oct 25 11:22 exonInfo.tab
7
-rw-rw-r-- 1 ycl6 ycl6 928822 Oct 25 11:22 geneInfo.tab
8
-rw-rw-r-- 1 ycl6 ycl6 3208868819 Oct 25 11:27 Genome
9
-rw-rw-r-- 1 ycl6 ycl6 645 Oct 25 10:32 genomeParameters.txt
10
-rw-rw-r-- 1 ycl6 ycl6 24881828956 Oct 25 11:29 SA
11
-rw-rw-r-- 1 ycl6 ycl6 1565873619 Oct 25 11:29 SAindex
12
-rw-rw-r-- 1 ycl6 ycl6 10235563 Oct 25 11:22 sjdbInfo.txt
13
-rw-rw-r-- 1 ycl6 ycl6 8020752 Oct 25 11:22 sjdbList.fromGTF.out.tab
14
-rw-rw-r-- 1 ycl6 ycl6 8019182 Oct 25 11:22 sjdbList.out.tab
15
-rw-rw-r-- 1 ycl6 ycl6 11688566 Oct 25 11:22 transcriptInfo.tab
Copied!

RSEM

Usage

rsem-prepare-reference [options] reference_fasta_file(s) reference_name

Execute

1
cd ~/LSLNGS2015
2
mkdir GENOME_data/rsem
3
4
rsem-prepare-reference --gtf GENOME_data/Homo_sapiens.GRCh38.86.gtf \
5
GENOME_data/Homo_sapiens.GRCh38.dna.primary_assembly.fa GENOME_data/rsem/GRCh38
Copied!

Resource usage

ALPS Queue Name
CPU Time
Max Memory
Duration
48G
140.03 sec.
2 GB
2 minutes and 26 seconds

Options

--gtf option specifies path to the gene annotations (in GTF format), and RSEM assumes the FASTA file contains sequence of a genome. If this option is off, RSEM will assume the FASTA file contains the reference transcripts. The name of each sequence in the Multi-FASTA files is its transcript_id.

Take a look at the RSEM indices generated

ls -la ~/LSLNGS2015/GENOME_data/rsem
1
-rw-rw-r-- 1 ycl6 ycl6 696 Oct 25 08:12 GRCh38.chrlist
2
-rw-rw-r-- 1 ycl6 ycl6 393402 Oct 25 08:12 GRCh38.grp
3
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.idx.fa
4
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.n2g.idx.fa
5
-rw-rw-r-- 1 ycl6 ycl6 318104358 Oct 25 08:12 GRCh38.seq
6
-rw-rw-r-- 1 ycl6 ycl6 135696331 Oct 25 08:12 GRCh38.ti
7
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.transcripts.fa
Copied!
Last modified 1yr ago