# Mapping with STAR

## Execute

```
cd ~/LSLNGS2015/
mkdir RNASEQ_data/star_GM12878_rep1 RNASEQ_data/star_GM12878_rep2 

STAR --genomeDir GENOME_data/star --readFilesCommand zcat \
--readFilesIn RNASEQ_data/GM12878.rep1.R1.fastq.gz RNASEQ_data/GM12878.rep1.R2.fastq.gz \
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within \
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM \
--runThreadN 20 --outFileNamePrefix "RNASEQ_data/star_GM12878_rep1/"

STAR --genomeDir GENOME_data/star --readFilesCommand zcat \
--readFilesIn RNASEQ_data/GM12878.rep2.R1.fastq.gz RNASEQ_data/GM12878.rep2.R2.fastq.gz  \
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within \
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM \
--runThreadN 20 --outFileNamePrefix "RNASEQ_data/star_GM12878_rep2/"

mkdir RNASEQ_data/star_K562_rep1 RNASEQ_data/star_K562_rep2

STAR --genomeDir GENOME_data/star --readFilesCommand zcat \
--readFilesIn RNASEQ_data/K562.rep1.R1.fastq.gz RNASEQ_data/K562.rep1.R2.fastq.gz \
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within \
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM \
--runThreadN 20 --outFileNamePrefix "RNASEQ_data/star_K562_rep1/"

STAR --genomeDir GENOME_data/star --readFilesCommand zcat \
--readFilesIn RNASEQ_data/K562.rep2.R1.fastq.gz RNASEQ_data/K562.rep2.R2.fastq.gz \
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within \
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM \
--runThreadN 20 --outFileNamePrefix "RNASEQ_data/star_K562_rep2/"
```

## Resource usage

| Sample  | ALPS Queue Name | CPU Time       | Max Memory | Duration                           |
| ------- | --------------- | -------------- | ---------- | ---------------------------------- |
| GM12878 | 128G            | 123358.00 sec. | 40 GB      | 3 hours 6 minutes 59 seconds       |
| K562    | 128G            | 110262.00 sec. | 40 Gb      | 3 hours, 49 minutes and 52 seconds |

## Options

`--genomeDir` path to the directory where genome files are stored.

`--sjdbGTFfile` skip this if provided during database creation step.

`--readFilesIn` paths to files that contain input read1 (and read2 if PE sequencing).

`--readFilesCommand` command line to execute for each of the input file. For example: `zcat` to uncompress .gz files.

`--outSAMtype` type of output, i.e. SAM or BAM.

`--outFilterMultimapNmax` read alignments will be output only if the read maps fewer than this value, otherwise no alignments will be output. Default is 10.

`--outSAMunmapped` output of unmapped reads in the SAM format, None or Within SAM file.

`--quantMode` types of quantification requested, i.e. GeneCounts and/or TranscriptomeSAM.

`--twopassMode` 2-pass mapping mode. In the first pass, the novel junctions are detected and inserted into the genome indices. In the second pass, all reads will be re-mapped using annotated (from the GTF file) and novel (detected in the first pass) junctions. While this doubles the run time, it significantly increases sensitivity to novel splice junctions.

`--runThreadN` number of threads to run STAR.

`--outFileNamePrefix` output files name prefix.

## Take a look at the STAR alignment files generated

`ls -la ~/LSLNGS2015/RNASEQ_data/star_GM12878_rep1/`

```
-rw-rw-r--  1 ycl6 ycl6 15995801460 Oct 25 12:57 Aligned.sortedByCoord.out.bam
-rw-rw-r--  1 ycl6 ycl6 20352373974 Oct 25 12:36 Aligned.toTranscriptome.out.bam
-rw-rw-r--  1 ycl6 ycl6        1866 Oct 25 12:57 Log.final.out
-rw-rw-r--  1 ycl6 ycl6       27784 Oct 25 12:57 Log.out
-rw-rw-r--  1 ycl6 ycl6        6111 Oct 25 12:57 Log.progress.out
-rw-rw-r--  1 ycl6 ycl6     8360503 Oct 25 12:57 SJ.out.tab
drwx------  2 ycl6 ycl6        4096 Oct 25 11:42 _STARgenome
drwx------  2 ycl6 ycl6        4096 Oct 25 12:00 _STARpass1
```

## Simple statistics with samtools flagstat

`samtools flagstat star_GM12878_rep1/Aligned.sortedByCoord.out.bam`

```
210178300 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
190658902 + 0 mapped (90.71% : N/A)
210178300 + 0 paired in sequencing
105089150 + 0 read1
105089150 + 0 read2
190658902 + 0 properly paired (90.71% : N/A)
190658902 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
```

## Alignment report

The `Log.final.out` shows the mapping statistics, it is very useful for quality control. The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads. STAR counts a paired-end read as one read. Most of the information is collected about the UNIQUE mappers. Each splicing is counted in the numbers of splices, and will correspond to summing the counts in `SJ.out.tab`. The mismatch/indel error rates are calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers divided by the total number of mapped bases.

`cat ~/LSLNGS2015/RNASEQ_data/star_GM12878_rep1/Log.final.out`

```
                                 Started job on |       Oct 25 11:38:22
                             Started mapping on |       Oct 25 12:03:39
                                    Finished on |       Oct 25 12:57:53
       Mapping speed, Million of reads per hour |       116.26

                          Number of input reads |       105089150
                      Average input read length |       202
                                    UNIQUE READS:
                   Uniquely mapped reads number |       95329451
                        Uniquely mapped reads % |       90.71%
                          Average mapped length |       200.93
                       Number of splices: Total |       42610775
            Number of splices: Annotated (sjdb) |       42438693
                       Number of splices: GT/AG |       42035836
                       Number of splices: GC/AG |       373883
                       Number of splices: AT/AC |       42947
               Number of splices: Non-canonical |       158109
                      Mismatch rate per base, % |       0.29%
                         Deletion rate per base |       0.02%
                        Deletion average length |       1.57
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.49
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       0
             % of reads mapped to multiple loci |       0.00%
        Number of reads mapped to too many loci |       7822366
             % of reads mapped to too many loci |       7.44%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       1.82%
                     % of reads unmapped: other |       0.02%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%
```

## Splice junctions

`SJ.out.tab` contains high confidence collapsed splice junctions in tab-delimited format. STAR defines the junction start/end as intronic bases, other software may define them as exonic bases. The columns have the following meaning:

| Column | Description                                                                                 |
| ------ | ------------------------------------------------------------------------------------------- |
| 1      | chromosome                                                                                  |
| 2      | first base of the intron (1-based)                                                          |
| 3      | last base of the intron (1-based)                                                           |
| 4      | strand (0: undefined, 1: +, 2: -)                                                           |
| 5      | intron motif (0: non-canonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT) |
| 6      | 0: unannotated, 1: annotated (only if splice junctions database is used)                    |
| 7      | number of uniquely mapping reads crossing the junction                                      |
| 8      | number of multi-mapping reads crossing the junction                                         |
| 9      | maximum spliced alignment overhang                                                          |

`head ~/LSLNGS2015/RNASEQ_data/star_GM12878_rep1/SJ.out.tab`

```
1       15039   15795   2       2       1       10      0       35
1       15247   185761  0       0       1       12      0       37
1       16607   186518  2       4       1       4       0       26
1       16766   16857   2       2       1       27      0       42
1       17056   17232   2       2       1       15      0       25
1       17056   17525   2       2       1       2       0       42
1       17056   17914   2       2       1       4       0       19
1       17369   17605   2       2       1       62      0       50
1       17530   187839  0       0       1       4       0       31
1       17743   17914   2       2       1       61      0       50
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ycl6.gitbook.io/rna-seq-data-analysis/rna-seq_analysis_workflow/mapping_with_star.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
