Alignment-based method

STAR alignment

  • Per-sample 2-pass mapping is enabled with --twopassMode Basic and the --sjdbOverhang option is set to 150 (the same value used to generate genome index here)

  • Alignment is run with 6 threads --runThreadN 6

  • The --quantMode TranscriptomeSAM option generates alignments translated into transcript coordinates in the Aligned.toTranscriptome.out.bam file necessary for downstream quantification using the Salmon/RSEM workflow

  • The --quantMode GeneCounts option allows STAR to count the number reads per gene while mapping and outputs the ReadsPerGene.out.tab count table

  • With --quantMode TranscriptomeSAM GeneCounts, STAR produces both the Aligned.toTranscriptome.out.bam and ReadsPerGene.out.tab outputs

  • The --quantTranscriptomeBan IndelSoftclipSingleend option (default) satisfies RSEM requirements, i.e. soft-clipping or indels are not allowed. However, it can be changed to--quantTranscriptomeBan Singleend when using other quantification software such as Salmon and eXpress.

  • Detailed descriptions of all the parameters and options are available in the STARmanual.pdf

$ cd /home/USER/SSAPs
$ mkdir star

$ declare -a runname=("ERR2675454" "ERR2675455" "ERR2675458" "ERR2675459" "ERR2675460" "ERR2675461" "ERR2675464" "ERR2675465" "ERR2675468" "ERR2675469" "ERR2675472" "ERR2675473" "ERR2675476" "ERR2675477" "ERR2675478" "ERR2675479" "ERR2675480" "ERR2675481" "ERR2675484" "ERR2675485")

# Modified from ENCODE-DCC/rna-seq-pipeline; https://github.com/ENCODE-DCC/rna-seq-pipeline
for id in ${runname[@]}; do
        trim1=trimmed/${id}_1.fastq.gz
        trim2=trimmed/${id}_2.fastq.gz
        out=star/${id}_

        STAR --genomeDir /home/USER/db/refanno/gencode.v33_star-2.7.3a \
        --readFilesIn $trim1 $trim2 \
        --outFileNamePrefix $out \
        --readFilesCommand zcat \
        --runThreadN 6 \
        --genomeLoad NoSharedMemory \
        --twopassMode Basic \
        --sjdbGTFfile /home/USER/db/refanno/gencode.v33.annotation.gtf \
        --sjdbScore 2 \
        --sjdbOverhang 150 \
        --limitSjdbInsertNsj 1000000 \
        --outFilterMultimapNmax 20 \
        --alignSJoverhangMin 8 \
        --alignSJDBoverhangMin 1 \
        --outFilterMismatchNmax 999 \
        --outFilterMismatchNoverReadLmax 0.04 \
        --alignIntronMin 20 \
        --alignIntronMax 1000000 \
        --alignMatesGapMax 1000000 \
        --outSAMunmapped Within \
        --outFilterType BySJout \
        --outSAMattributes NH HI AS NM MD \
        --outSAMtype BAM SortedByCoordinate \
        --quantMode TranscriptomeSAM GeneCounts \
        --quantTranscriptomeBan IndelSoftclipSingleend \
        --limitBAMsortRAM 32000000000
done
rm -R star/*_STARgenome star/*_STARpass1

The percentaged of uniquely mapped reads for ERR2675454 is around 89.5%, with multi-mappers contributing to about 8.6% of the alignment

star/ERR2675454_Log.final.out
                                 Started job on |       Apr 29 12:03:14
                             Started mapping on |       Apr 29 12:19:15
                                    Finished on |       Apr 29 12:44:10
       Mapping speed, Million of reads per hour |       71.90

                          Number of input reads |       29857721
                      Average input read length |       260
                                    UNIQUE READS:
                   Uniquely mapped reads number |       26729034
                        Uniquely mapped reads % |       89.52%
                          Average mapped length |       260.28
                       Number of splices: Total |       23442395
            Number of splices: Annotated (sjdb) |       23441965
                       Number of splices: GT/AG |       23204758
                       Number of splices: GC/AG |       196371
                       Number of splices: AT/AC |       21311
               Number of splices: Non-canonical |       19955
                      Mismatch rate per base, % |       0.23%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.74
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.58
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       2553618
             % of reads mapped to multiple loci |       8.55%
        Number of reads mapped to too many loci |       5120
             % of reads mapped to too many loci |       0.02%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       517770
                 % of reads unmapped: too short |       1.73%
                Number of reads unmapped: other |       52179
                     % of reads unmapped: other |       0.17%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

With --quantMode GeneCounts, STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:

  • column 1: gene ID

  • column 2: counts for unstranded RNA-seq (htseq-count option --stranded no)

  • column 3: counts for the 1st read strand aligned with RNA (htseq-count option --stranded yes)

  • column 4: counts for the 2nd read strand aligned with RNA (htseq-count option --stranded reverse)

star/ERR2675454_ReadsPerGene.out.tab
N_unmapped	639576	639576	639576
N_multimapping	2553618	2553618	2553618
N_noFeature	1190394	17568512	9176084
N_ambiguous	2406991	452908	854751
ENSG00000223972.5	0	1	0
ENSG00000227232.5	38	7	32
ENSG00000278267.1	0	0	0
ENSG00000243485.5	0	0	0
ENSG00000284332.1	0	0	0
ENSG00000237613.2	0	0	0

STAR-Salmon Quantification (alignment-based mode)

Alignment with STAR to the target genome, followed by quantification using Salmon with 6 threads

$ cd /home/USER/SSAPs

$ declare -a runname=("ERR2675454" "ERR2675455" "ERR2675458" "ERR2675459" "ERR2675460" "ERR2675461" "ERR2675464" "ERR2675465" "ERR2675468" "ERR2675469" "ERR2675472" "ERR2675473" "ERR2675476" "ERR2675477" "ERR2675478" "ERR2675479" "ERR2675480" "ERR2675481" "ERR2675484" "ERR2675485")

for id in ${runname[@]}; do
        bam=star/${id}_Aligned.toTranscriptome.out.bam
        
        salmon quant --threads 6 \
        --targets /home/USER/db/refanno/gencode.v33.transcripts.fa \
        --gencode \
        --libType ISR \
        --output salmon_star/$id \
        --alignments $bam
done
salmon_star/ERR2675454/quant.sf
Name	Length	EffectiveLength	TPM	NumReads
ENST00000456328.2	1657	1492.594	0.000000	0.000
ENST00000450305.2	632	467.767	0.000000	0.000
ENST00000488147.1	1351	1186.594	4.628079	111.707
ENST00000619216.1	68	9.146	0.000000	0.000
ENST00000473358.1	712	547.720	0.000000	0.000
ENST00000469289.1	535	371.066	0.000000	0.000
ENST00000607096.1	138	26.122	0.000000	0.000
ENST00000417324.1	1187	1022.594	0.000000	0.000
ENST00000461467.1	590	425.837	0.000000	0.000

STAR-RSEM Quantification

Alignment with STAR to the target genome, followed by quantification using RSEM with 6 threads. The --strandedness reverse was used to allow RSEM to quantify with the correct strandedness setting

$ cd /home/USER/SSAPs
$ mkdir rsem_star

$ declare -a runname=("ERR2675454" "ERR2675455" "ERR2675458" "ERR2675459" "ERR2675460" "ERR2675461" "ERR2675464" "ERR2675465" "ERR2675468" "ERR2675469" "ERR2675472" "ERR2675473" "ERR2675476" "ERR2675477" "ERR2675478" "ERR2675479" "ERR2675480" "ERR2675481" "ERR2675484" "ERR2675485")

for id in ${runname[@]}; do
        bam=star/${id}_Aligned.toTranscriptome.out.bam
        out=rsem_star/${id}
        log=rsem_star/${id}.log

        rsem-calculate-expression -p 6 --paired-end --alignments \
        --estimate-rspd \
        --calc-ci \
        --ci-memory 32000 \
        --seed 123456 \
        --no-bam-output \
        --strandedness reverse \
        $bam /home/USER/db/refanno/gencode.v33_rsem-1.3.3 $out > $log
done
# RSEM parameters
--paired-end      - input reads are paired-end reads
--alignments      - input file contains alignments in SAM/BAM/CRAM format
--estimate-rspd   - estimate the read start position distribution from data
--calc-ci         - calculate 95% credibility intervals (CI) and posterior mean estimates (PME)
--ci-memory 32000 - maximum memory (MB) of the auxiliary buffer used for computing CI
--seed 123456     - set the seed for the random number generators used in calculating PME and CI
--no-bam-output   - do not output any BAM file
--strandedness reverse - defines the strandedness of the RNA-Seq reads
cut -f1-8 rsem_star/ERR2675454.genes.results | head
gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM	posterior_mean_count
ENSG00000000003.15	ENST00000373020.9,ENST00000494424.1,ENST00000496771.5,ENST00000612152.4,ENST00000614008.4	2632.85	2461.28	991.00	29.20	23.82	991.00
ENSG00000000005.6	ENST00000373031.5,ENST00000485971.1	995.15	823.61	34.00	2.99	2.44	34.00
ENSG00000000419.12	ENST00000371582.8,ENST00000371584.8,ENST00000371588.9,ENST00000413082.1,ENST00000466152.5,ENST00000494752.1	1078.11	906.55	687.00	54.95	44.82	687.00
ENSG00000000457.14	ENST00000367770.5,ENST00000367771.11,ENST00000367772.8,ENST00000423670.1,ENST00000470238.1	4468.29	4296.72	334.00	5.64	4.60	334.00
ENSG00000000460.17	ENST00000286031.10,ENST00000359326.9,ENST00000413811.3,ENST00000459772.5,ENST00000466580.6,ENST00000472795.5,ENST00000481744.5,ENST00000496973.5,ENST00000498289.5	3103.44	2931.88	165.00	4.08	3.33	165.00
ENSG00000000938.13	ENST00000374003.7,ENST00000374004.5,ENST00000374005.8,ENST00000399173.5,ENST00000457296.5,ENST00000468038.1,ENST00000475472.5	2033.32	1861.76	36.00	1.40	1.14	36.00
ENSG00000000971.16	ENST00000359637.2,ENST00000367429.9,ENST00000466229.5,ENST00000470918.1,ENST00000496761.1,ENST00000630130.2	3484.05	3312.48	621.84	13.61	11.10	621.57
ENSG00000001036.14	ENST00000002165.11,ENST00000367585.1,ENST00000451668.1	2356.60	2185.03	1508.00	50.05	40.82	1508.00
ENSG00000001084.13	ENST00000504353.1,ENST00000504525.1,ENST00000505197.1,ENST00000505294.5,ENST00000509541.5,ENST00000510837.5,ENST00000513939.6,ENST00000514004.5,ENST00000514373.3,ENST00000514933.2,ENST00000515580.1,ENST00000616923.5,ENST00000643939.1,ENST00000650454.1	2835.97	2664.41	887.00	24.14	19.69	887.00
cut -f1-9 rsem_star/ERR2675454.isoforms.results | head
transcript_id	gene_id	length	effective_length	expected_count	TPM	FPKM	IsoPct	posterior_mean_count
ENST00000373020.9	ENSG00000000003.15	3768	3596.44	874.43	17.63	14.38	60.39	872.33
ENST00000494424.1	ENSG00000000003.15	820	648.44	0.00	0.00	0.00	0.00	1.26
ENST00000496771.5	ENSG00000000003.15	1025	853.44	2.66	0.23	0.18	0.77	4.23
ENST00000612152.4	ENSG00000000003.15	3796	3624.44	0.00	0.00	0.00	0.00	2.21
ENST00000614008.4	ENSG00000000003.15	900	728.44	113.92	11.34	9.25	38.84	110.97
ENST00000373031.5	ENSG00000000005.6	1205	1033.44	29.16	2.05	1.67	68.35	28.75
ENST00000485971.1	ENSG00000000005.6	542	370.52	4.84	0.95	0.77	31.65	5.25
ENST00000371582.8	ENSG00000000419.12	1161	989.44	26.45	1.94	1.58	3.53	25.05
ENST00000371584.8	ENSG00000000419.12	1073	901.44	18.00	1.45	1.18	2.64	20.08

Last updated