Create mapping indices
Before we can perform NGS read mapping, we will create the genome indices using the genome FASTA file as input. You can re-use these indices in all your future RNA-seq mapping. However, if you wish to map to a different genome build/assembly, you have to re-run this step using different genome sequences and save the indices in a different directory. Also, you might also need to build new indices if using a newer version of these software, always check the relavent README or CHANGELOG.
Here, we will create indices for STAR and RSEM.
STAR
Usage
STAR --runMode genomeGenerate --genomeDir path_to_genomedir --genomeFastaFiles reference_fasta_file(s)
Execute
cd ~/LSLNGS2015
mkdir GENOME_data/star
STAR --runThreadN 40 --runMode genomeGenerate --genomeDir GENOME_data/star \
--genomeFastaFiles GENOME_data/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile GENOME_data/Homo_sapiens.GRCh38.86.gtf
Resource usage
ALPS Queue Name
CPU Time
Max Memory
Duration
128G
51610.02 sec.
31 GB
1 hour 15 minutes 6 seconds
Options
--runThreadN
defines the number of threads to be used for genome generation.
--runMode genomeGenerate
directs STAR to run genome indices generation job.
--genomeDir
path to the directory where the genome indices are stored. This directory has to be created (with mkdir
) before STAR run and needs to writing permissions. The file system needs to have at least 100GB of disk space available for a typical mammalian genome.
--genomeFastaFiles
one or more FASTA files with the genome reference sequences.
--sjdbGTFfile
path to the transcript annotation in the standard GTF format.
Take a look at the STAR indices generated
ls -la ~/LSLNGS2015/GENOME_data/star
-rw-rw-r-- 1 ycl6 ycl6 1200 Oct 25 10:32 chrLength.txt
-rw-rw-r-- 1 ycl6 ycl6 3123 Oct 25 10:32 chrNameLength.txt
-rw-rw-r-- 1 ycl6 ycl6 1923 Oct 25 10:32 chrName.txt
-rw-rw-r-- 1 ycl6 ycl6 2129 Oct 25 10:32 chrStart.txt
-rw-rw-r-- 1 ycl6 ycl6 41854837 Oct 25 11:22 exonGeTrInfo.tab
-rw-rw-r-- 1 ycl6 ycl6 16985258 Oct 25 11:22 exonInfo.tab
-rw-rw-r-- 1 ycl6 ycl6 928822 Oct 25 11:22 geneInfo.tab
-rw-rw-r-- 1 ycl6 ycl6 3208868819 Oct 25 11:27 Genome
-rw-rw-r-- 1 ycl6 ycl6 645 Oct 25 10:32 genomeParameters.txt
-rw-rw-r-- 1 ycl6 ycl6 24881828956 Oct 25 11:29 SA
-rw-rw-r-- 1 ycl6 ycl6 1565873619 Oct 25 11:29 SAindex
-rw-rw-r-- 1 ycl6 ycl6 10235563 Oct 25 11:22 sjdbInfo.txt
-rw-rw-r-- 1 ycl6 ycl6 8020752 Oct 25 11:22 sjdbList.fromGTF.out.tab
-rw-rw-r-- 1 ycl6 ycl6 8019182 Oct 25 11:22 sjdbList.out.tab
-rw-rw-r-- 1 ycl6 ycl6 11688566 Oct 25 11:22 transcriptInfo.tab
RSEM
Usage
rsem-prepare-reference [options] reference_fasta_file(s) reference_name
Execute
cd ~/LSLNGS2015
mkdir GENOME_data/rsem
rsem-prepare-reference --gtf GENOME_data/Homo_sapiens.GRCh38.86.gtf \
GENOME_data/Homo_sapiens.GRCh38.dna.primary_assembly.fa GENOME_data/rsem/GRCh38
Resource usage
ALPS Queue Name
CPU Time
Max Memory
Duration
48G
140.03 sec.
2 GB
2 minutes and 26 seconds
Options
--gtf
option specifies path to the gene annotations (in GTF format), and RSEM assumes the FASTA file contains sequence of a genome. If this option is off, RSEM will assume the FASTA file contains the reference transcripts. The name of each sequence in the Multi-FASTA files is its transcript_id.
Take a look at the RSEM indices generated
ls -la ~/LSLNGS2015/GENOME_data/rsem
-rw-rw-r-- 1 ycl6 ycl6 696 Oct 25 08:12 GRCh38.chrlist
-rw-rw-r-- 1 ycl6 ycl6 393402 Oct 25 08:12 GRCh38.grp
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.idx.fa
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.n2g.idx.fa
-rw-rw-r-- 1 ycl6 ycl6 318104358 Oct 25 08:12 GRCh38.seq
-rw-rw-r-- 1 ycl6 ycl6 135696331 Oct 25 08:12 GRCh38.ti
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.transcripts.fa
Last updated
Was this helpful?