> For the complete documentation index, see [llms.txt](https://ycl6.gitbook.io/rna-seq-data-analysis/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ycl6.gitbook.io/rna-seq-data-analysis/rna-seq_analysis_workflow/create_mapping_indices.md).

# Create mapping indices

Before we can perform NGS read mapping, we will create the genome indices using the genome FASTA file as input. You can re-use these indices in all your future RNA-seq mapping. However, if you wish to map to a different genome build/assembly, you have to re-run this step using different genome sequences and save the indices in a different directory. Also, you might also need to build new indices if using a newer version of these software, always check the relavent README or CHANGELOG.

Here, we will create indices for [STAR](https://github.com/alexdobin/STAR) and [RSEM](https://github.com/deweylab/RSEM).

## STAR

### Usage

`STAR --runMode genomeGenerate --genomeDir path_to_genomedir --genomeFastaFiles reference_fasta_file(s)`

### Execute

```
cd ~/LSLNGS2015
mkdir GENOME_data/star

STAR --runThreadN 40 --runMode genomeGenerate --genomeDir GENOME_data/star \
--genomeFastaFiles GENOME_data/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile GENOME_data/Homo_sapiens.GRCh38.86.gtf
```

### Resource usage

| ALPS Queue Name | CPU Time      | Max Memory | Duration                    |
| --------------- | ------------- | ---------- | --------------------------- |
| 128G            | 51610.02 sec. | 31 GB      | 1 hour 15 minutes 6 seconds |

### Options

`--runThreadN` defines the number of threads to be used for genome generation.

`--runMode genomeGenerate` directs STAR to run genome indices generation job.

`--genomeDir` path to the directory where the genome indices are stored. This directory has to be created (with `mkdir`) before STAR run and needs to writing permissions. The file system needs to have at least 100GB of disk space available for a typical mammalian genome.

`--genomeFastaFiles` one or more FASTA files with the genome reference sequences.

`--sjdbGTFfile` path to the transcript annotation in the standard GTF format.

### Take a look at the STAR indices generated

`ls -la ~/LSLNGS2015/GENOME_data/star`

```
-rw-rw-r-- 1 ycl6 ycl6        1200 Oct 25 10:32 chrLength.txt
-rw-rw-r-- 1 ycl6 ycl6        3123 Oct 25 10:32 chrNameLength.txt
-rw-rw-r-- 1 ycl6 ycl6        1923 Oct 25 10:32 chrName.txt
-rw-rw-r-- 1 ycl6 ycl6        2129 Oct 25 10:32 chrStart.txt
-rw-rw-r-- 1 ycl6 ycl6    41854837 Oct 25 11:22 exonGeTrInfo.tab
-rw-rw-r-- 1 ycl6 ycl6    16985258 Oct 25 11:22 exonInfo.tab
-rw-rw-r-- 1 ycl6 ycl6      928822 Oct 25 11:22 geneInfo.tab
-rw-rw-r-- 1 ycl6 ycl6  3208868819 Oct 25 11:27 Genome
-rw-rw-r-- 1 ycl6 ycl6         645 Oct 25 10:32 genomeParameters.txt
-rw-rw-r-- 1 ycl6 ycl6 24881828956 Oct 25 11:29 SA
-rw-rw-r-- 1 ycl6 ycl6  1565873619 Oct 25 11:29 SAindex
-rw-rw-r-- 1 ycl6 ycl6    10235563 Oct 25 11:22 sjdbInfo.txt
-rw-rw-r-- 1 ycl6 ycl6     8020752 Oct 25 11:22 sjdbList.fromGTF.out.tab
-rw-rw-r-- 1 ycl6 ycl6     8019182 Oct 25 11:22 sjdbList.out.tab
-rw-rw-r-- 1 ycl6 ycl6    11688566 Oct 25 11:22 transcriptInfo.tab
```

## RSEM

### Usage

`rsem-prepare-reference [options] reference_fasta_file(s) reference_name`

### Execute

```
cd ~/LSLNGS2015
mkdir GENOME_data/rsem

rsem-prepare-reference --gtf GENOME_data/Homo_sapiens.GRCh38.86.gtf \ 
GENOME_data/Homo_sapiens.GRCh38.dna.primary_assembly.fa GENOME_data/rsem/GRCh38
```

### Resource usage

| ALPS Queue Name | CPU Time    | Max Memory | Duration                 |
| --------------- | ----------- | ---------- | ------------------------ |
| 48G             | 140.03 sec. | 2 GB       | 2 minutes and 26 seconds |

### Options

`--gtf` option specifies path to the gene annotations (in GTF format), and RSEM assumes the FASTA file contains sequence of a genome. If this option is off, RSEM will assume the FASTA file contains the reference transcripts. The name of each sequence in the Multi-FASTA files is its transcript\_id.

### Take a look at the RSEM indices generated

`ls -la ~/LSLNGS2015/GENOME_data/rsem`

```
-rw-rw-r-- 1 ycl6 ycl6       696 Oct 25 08:12 GRCh38.chrlist
-rw-rw-r-- 1 ycl6 ycl6    393402 Oct 25 08:12 GRCh38.grp
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.idx.fa
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.n2g.idx.fa
-rw-rw-r-- 1 ycl6 ycl6 318104358 Oct 25 08:12 GRCh38.seq
-rw-rw-r-- 1 ycl6 ycl6 135696331 Oct 25 08:12 GRCh38.ti
-rw-rw-r-- 1 ycl6 ycl6 297970512 Oct 25 08:12 GRCh38.transcripts.fa
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://ycl6.gitbook.io/rna-seq-data-analysis/rna-seq_analysis_workflow/create_mapping_indices.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
