# Create mapping indices

Before we can perform NGS read mapping, we will create the genome indices using the genome FASTA file as input. You can re-use these indices in all your future RNA-seq mapping. However, if you wish to map to a different genome build/assembly, you have to re-run this step using different genome sequences and save the indices in a different directory. Also, you might also need to build new indices if using a newer version of these software, always check the relavent README or CHANGELOG.

Here, we will create indices for [STAR](https://github.com/alexdobin/STAR) and [RSEM](https://github.com/deweylab/RSEM).

## STAR

### Usage

`STAR --runMode genomeGenerate --genomeDir path_to_genomedir --genomeFastaFiles reference_fasta_file(s)`

### Execute

```
cd ~/LSLNGS2015
mkdir GENOME_data/star

STAR --runThreadN 40 --runMode genomeGenerate --genomeDir GENOME_data/star \
--genomeFastaFiles GENOME_data/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile GENOME_data/Homo_sapiens.GRCh38.86.gtf
```

### Resource usage

| ALPS Queue Name | CPU Time      | Max Memory | Duration                    |
| --------------- | ------------- | ---------- | --------------------------- |
| 128G            | 51610.02 sec. | 31 GB      | 1 hour 15 minutes 6 seconds |

### Options

`--runThreadN` defines the number of threads to be used for genome generation.

`--runMode genomeGenerate` directs STAR to run genome indices generation job.

`--genomeDir` path to the directory where the genome indices are stored. This directory has to be created (with `mkdir`) before STAR run and needs to writing permissions. The file system needs to have at least 100GB of disk space available for a typical mammalian genome.

`--genomeFastaFiles` one or more FASTA files with the genome reference sequences.

`--sjdbGTFfile` path to the transcript annotation in the standard GTF format.

### Take a look at the STAR indices generated

`ls -la ~/LSLNGS2015/GENOME_data/star`

```
-rw-rw-r-- 1 ycl6 ycl6        1200 Oct 25 10:32 chrLength.txt
-rw-rw-r-- 1 ycl6 ycl6        3123 Oct 25 10:32 chrNameLength.txt
-rw-rw-r-- 1 ycl6 ycl6        1923 Oct 25 10:32 chrName.txt
-rw-rw-r-- 1 ycl6 ycl6        2129 Oct 25 10:32 chrStart.txt
-rw-rw-r-- 1 ycl6 ycl6    41854837 Oct 25 11:22 exonGeTrInfo.tab
-rw-rw-r-- 1 ycl6 ycl6    16985258 Oct 25 11:22 exonInfo.tab
-rw-rw-r-- 1 ycl6 ycl6      928822 Oct 25 11:22 geneInfo.tab
-rw-rw-r-- 1 ycl6 ycl6  3208868819 Oct 25 11:27 Genome
-rw-rw-r-- 1 ycl6 ycl6         645 Oct 25 10:32 genomeParameters.txt
-rw-rw-r-- 1 ycl6 ycl6 24881828956 Oct 25 11:29 SA
-rw-rw-r-- 1 ycl6 ycl6  1565873619 Oct 25 11:29 SAindex
-rw-rw-r-- 1 ycl6 ycl6    10235563 Oct 25 11:22 sjdbInfo.txt
-rw-rw-r-- 1 ycl6 ycl6     8020752 Oct 25 11:22 sjdbList.fromGTF.out.tab
-rw-rw-r-- 1 ycl6 ycl6     8019182 Oct 25 11:22 sjdbList.out.tab
-rw-rw-r-- 1 ycl6 ycl6    11688566 Oct 25 11:22 transcriptInfo.tab
```

## RSEM

### Usage

`rsem-prepare-reference [options] reference_fasta_file(s) reference_name`

### Execute

```
cd ~/LSLNGS2015
mkdir GENOME_data/rsem

rsem-prepare-reference --gtf GENOME_data/Homo_sapiens.GRCh38.86.gtf \ 
GENOME_data/Homo_sapiens.GRCh38.dna.primary_assembly.fa GENOME_data/rsem/GRCh38
```

### Resource usage

| ALPS Queue Name | CPU Time    | Max Memory | Duration                 |
| --------------- | ----------- | ---------- | ------------------------ |
| 48G             | 140.03 sec. | 2 GB       | 2 minutes and 26 seconds |

### Options

`--gtf` option specifies path to the gene annotations (in GTF format), and RSEM assumes the FASTA file contains sequence of a genome. If this option is off, RSEM will assume the FASTA file contains the reference transcripts. The name of each sequence in the Multi-FASTA files is its transcript\_id.

### Take a look at the RSEM indices generated

`ls -la ~/LSLNGS2015/GENOME_data/rsem`

```
-rw-rw-r-- 1 ycl6 ycl6       696 Oct 25 08:12 GRCh38.chrlist
-rw-rw-r-- 1 ycl6 ycl6    393402 Oct 25 08:12 GRCh38.grp
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.idx.fa
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.n2g.idx.fa
-rw-rw-r-- 1 ycl6 ycl6 318104358 Oct 25 08:12 GRCh38.seq
-rw-rw-r-- 1 ycl6 ycl6 135696331 Oct 25 08:12 GRCh38.ti
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.transcripts.fa
```
