Mapping with STAR
Execute
cd ~/LSLNGS2015/
mkdir RNASEQ_data/star_GM12878_rep1 RNASEQ_data/star_GM12878_rep2
STAR --genomeDir GENOME_data/star --readFilesCommand zcat \
--readFilesIn RNASEQ_data/GM12878.rep1.R1.fastq.gz RNASEQ_data/GM12878.rep1.R2.fastq.gz \
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within \
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM \
--runThreadN 20 --outFileNamePrefix "RNASEQ_data/star_GM12878_rep1/"
STAR --genomeDir GENOME_data/star --readFilesCommand zcat \
--readFilesIn RNASEQ_data/GM12878.rep2.R1.fastq.gz RNASEQ_data/GM12878.rep2.R2.fastq.gz \
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within \
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM \
--runThreadN 20 --outFileNamePrefix "RNASEQ_data/star_GM12878_rep2/"
mkdir RNASEQ_data/star_K562_rep1 RNASEQ_data/star_K562_rep2
STAR --genomeDir GENOME_data/star --readFilesCommand zcat \
--readFilesIn RNASEQ_data/K562.rep1.R1.fastq.gz RNASEQ_data/K562.rep1.R2.fastq.gz \
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within \
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM \
--runThreadN 20 --outFileNamePrefix "RNASEQ_data/star_K562_rep1/"
STAR --genomeDir GENOME_data/star --readFilesCommand zcat \
--readFilesIn RNASEQ_data/K562.rep2.R1.fastq.gz RNASEQ_data/K562.rep2.R2.fastq.gz \
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within \
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM \
--runThreadN 20 --outFileNamePrefix "RNASEQ_data/star_K562_rep2/"Resource usage
Sample
ALPS Queue Name
CPU Time
Max Memory
Duration
GM12878
128G
123358.00 sec.
40 GB
3 hours 6 minutes 59 seconds
K562
128G
110262.00 sec.
40 Gb
3 hours, 49 minutes and 52 seconds
Options
--genomeDir path to the directory where genome files are stored.
--sjdbGTFfile skip this if provided during database creation step.
--readFilesIn paths to files that contain input read1 (and read2 if PE sequencing).
--readFilesCommand command line to execute for each of the input file. For example: zcat to uncompress .gz files.
--outSAMtype type of output, i.e. SAM or BAM.
--outFilterMultimapNmax read alignments will be output only if the read maps fewer than this value, otherwise no alignments will be output. Default is 10.
--outSAMunmapped output of unmapped reads in the SAM format, None or Within SAM file.
--quantMode types of quantification requested, i.e. GeneCounts and/or TranscriptomeSAM.
--twopassMode 2-pass mapping mode. In the first pass, the novel junctions are detected and inserted into the genome indices. In the second pass, all reads will be re-mapped using annotated (from the GTF file) and novel (detected in the first pass) junctions. While this doubles the run time, it significantly increases sensitivity to novel splice junctions.
--runThreadN number of threads to run STAR.
--outFileNamePrefix output files name prefix.
Take a look at the STAR alignment files generated
ls -la ~/LSLNGS2015/RNASEQ_data/star_GM12878_rep1/
Simple statistics with samtools flagstat
samtools flagstat star_GM12878_rep1/Aligned.sortedByCoord.out.bam
Alignment report
The Log.final.out shows the mapping statistics, it is very useful for quality control. The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads. STAR counts a paired-end read as one read. Most of the information is collected about the UNIQUE mappers. Each splicing is counted in the numbers of splices, and will correspond to summing the counts in SJ.out.tab. The mismatch/indel error rates are calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers divided by the total number of mapped bases.
cat ~/LSLNGS2015/RNASEQ_data/star_GM12878_rep1/Log.final.out
Splice junctions
SJ.out.tab contains high confidence collapsed splice junctions in tab-delimited format. STAR defines the junction start/end as intronic bases, other software may define them as exonic bases. The columns have the following meaning:
Column
Description
1
chromosome
2
first base of the intron (1-based)
3
last base of the intron (1-based)
4
strand (0: undefined, 1: +, 2: -)
5
intron motif (0: non-canonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT)
6
0: unannotated, 1: annotated (only if splice junctions database is used)
7
number of uniquely mapping reads crossing the junction
8
number of multi-mapping reads crossing the junction
9
maximum spliced alignment overhang
head ~/LSLNGS2015/RNASEQ_data/star_GM12878_rep1/SJ.out.tab
Last updated
Was this helpful?