Alignment-based method
STAR alignment
Per-sample 2-pass mapping is enabled with
--twopassMode Basic
and the--sjdbOverhang
option is set to 150 (the same value used to generate genome index here)Alignment is run with 6 threads
--runThreadN 6
The
--quantMode TranscriptomeSAM
option generates alignments translated into transcript coordinates in theAligned.toTranscriptome.out.bam
file necessary for downstream quantification using the Salmon/RSEM workflowThe
--quantMode GeneCounts
option allows STAR to count the number reads per gene while mapping and outputs theReadsPerGene.out.tab
count tableWith
--quantMode TranscriptomeSAM GeneCounts
, STAR produces both theAligned.toTranscriptome.out.bam
andReadsPerGene.out.tab
outputsThe
--quantTranscriptomeBan IndelSoftclipSingleend
option (default) satisfies RSEM requirements, i.e. soft-clipping or indels are not allowed. However, it can be changed to--quantTranscriptomeBan Singleend
when using other quantification software such as Salmon and eXpress.Detailed descriptions of all the parameters and options are available in the STARmanual.pdf
$ cd /home/USER/SSAPs
$ mkdir star
$ declare -a runname=("ERR2675454" "ERR2675455" "ERR2675458" "ERR2675459" "ERR2675460" "ERR2675461" "ERR2675464" "ERR2675465" "ERR2675468" "ERR2675469" "ERR2675472" "ERR2675473" "ERR2675476" "ERR2675477" "ERR2675478" "ERR2675479" "ERR2675480" "ERR2675481" "ERR2675484" "ERR2675485")
# Modified from ENCODE-DCC/rna-seq-pipeline; https://github.com/ENCODE-DCC/rna-seq-pipeline
for id in ${runname[@]}; do
trim1=trimmed/${id}_1.fastq.gz
trim2=trimmed/${id}_2.fastq.gz
out=star/${id}_
STAR --genomeDir /home/USER/db/refanno/gencode.v33_star-2.7.3a \
--readFilesIn $trim1 $trim2 \
--outFileNamePrefix $out \
--readFilesCommand zcat \
--runThreadN 6 \
--genomeLoad NoSharedMemory \
--twopassMode Basic \
--sjdbGTFfile /home/USER/db/refanno/gencode.v33.annotation.gtf \
--sjdbScore 2 \
--sjdbOverhang 150 \
--limitSjdbInsertNsj 1000000 \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outSAMunmapped Within \
--outFilterType BySJout \
--outSAMattributes NH HI AS NM MD \
--outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM GeneCounts \
--quantTranscriptomeBan IndelSoftclipSingleend \
--limitBAMsortRAM 32000000000
done
rm -R star/*_STARgenome star/*_STARpass1
The percentaged of uniquely mapped reads for ERR2675454 is around 89.5%, with multi-mappers contributing to about 8.6% of the alignment
Started job on | Apr 29 12:03:14
Started mapping on | Apr 29 12:19:15
Finished on | Apr 29 12:44:10
Mapping speed, Million of reads per hour | 71.90
Number of input reads | 29857721
Average input read length | 260
UNIQUE READS:
Uniquely mapped reads number | 26729034
Uniquely mapped reads % | 89.52%
Average mapped length | 260.28
Number of splices: Total | 23442395
Number of splices: Annotated (sjdb) | 23441965
Number of splices: GT/AG | 23204758
Number of splices: GC/AG | 196371
Number of splices: AT/AC | 21311
Number of splices: Non-canonical | 19955
Mismatch rate per base, % | 0.23%
Deletion rate per base | 0.01%
Deletion average length | 1.74
Insertion rate per base | 0.01%
Insertion average length | 1.58
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 2553618
% of reads mapped to multiple loci | 8.55%
Number of reads mapped to too many loci | 5120
% of reads mapped to too many loci | 0.02%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 517770
% of reads unmapped: too short | 1.73%
Number of reads unmapped: other | 52179
% of reads unmapped: other | 0.17%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
With --quantMode GeneCounts
, STAR outputs read counts per gene into ReadsPerGene.out.tab
file with 4 columns which correspond to different strandedness options:
column 1: gene ID
column 2: counts for unstranded RNA-seq (htseq-count option
--stranded no
)column 3: counts for the 1st read strand aligned with RNA (htseq-count option
--stranded yes
)column 4: counts for the 2nd read strand aligned with RNA (htseq-count option
--stranded reverse
)
N_unmapped 639576 639576 639576
N_multimapping 2553618 2553618 2553618
N_noFeature 1190394 17568512 9176084
N_ambiguous 2406991 452908 854751
ENSG00000223972.5 0 1 0
ENSG00000227232.5 38 7 32
ENSG00000278267.1 0 0 0
ENSG00000243485.5 0 0 0
ENSG00000284332.1 0 0 0
ENSG00000237613.2 0 0 0
STAR-Salmon Quantification (alignment-based mode)
Alignment with STAR to the target genome, followed by quantification using Salmon with 6 threads
$ cd /home/USER/SSAPs
$ declare -a runname=("ERR2675454" "ERR2675455" "ERR2675458" "ERR2675459" "ERR2675460" "ERR2675461" "ERR2675464" "ERR2675465" "ERR2675468" "ERR2675469" "ERR2675472" "ERR2675473" "ERR2675476" "ERR2675477" "ERR2675478" "ERR2675479" "ERR2675480" "ERR2675481" "ERR2675484" "ERR2675485")
for id in ${runname[@]}; do
bam=star/${id}_Aligned.toTranscriptome.out.bam
salmon quant --threads 6 \
--targets /home/USER/db/refanno/gencode.v33.transcripts.fa \
--gencode \
--libType ISR \
--output salmon_star/$id \
--alignments $bam
done
Name Length EffectiveLength TPM NumReads
ENST00000456328.2 1657 1492.594 0.000000 0.000
ENST00000450305.2 632 467.767 0.000000 0.000
ENST00000488147.1 1351 1186.594 4.628079 111.707
ENST00000619216.1 68 9.146 0.000000 0.000
ENST00000473358.1 712 547.720 0.000000 0.000
ENST00000469289.1 535 371.066 0.000000 0.000
ENST00000607096.1 138 26.122 0.000000 0.000
ENST00000417324.1 1187 1022.594 0.000000 0.000
ENST00000461467.1 590 425.837 0.000000 0.000
STAR-RSEM Quantification
Alignment with STAR to the target genome, followed by quantification using RSEM with 6 threads. The --strandedness reverse
was used to allow RSEM to quantify with the correct strandedness setting
$ cd /home/USER/SSAPs
$ mkdir rsem_star
$ declare -a runname=("ERR2675454" "ERR2675455" "ERR2675458" "ERR2675459" "ERR2675460" "ERR2675461" "ERR2675464" "ERR2675465" "ERR2675468" "ERR2675469" "ERR2675472" "ERR2675473" "ERR2675476" "ERR2675477" "ERR2675478" "ERR2675479" "ERR2675480" "ERR2675481" "ERR2675484" "ERR2675485")
for id in ${runname[@]}; do
bam=star/${id}_Aligned.toTranscriptome.out.bam
out=rsem_star/${id}
log=rsem_star/${id}.log
rsem-calculate-expression -p 6 --paired-end --alignments \
--estimate-rspd \
--calc-ci \
--ci-memory 32000 \
--seed 123456 \
--no-bam-output \
--strandedness reverse \
$bam /home/USER/db/refanno/gencode.v33_rsem-1.3.3 $out > $log
done
# RSEM parameters
--paired-end - input reads are paired-end reads
--alignments - input file contains alignments in SAM/BAM/CRAM format
--estimate-rspd - estimate the read start position distribution from data
--calc-ci - calculate 95% credibility intervals (CI) and posterior mean estimates (PME)
--ci-memory 32000 - maximum memory (MB) of the auxiliary buffer used for computing CI
--seed 123456 - set the seed for the random number generators used in calculating PME and CI
--no-bam-output - do not output any BAM file
--strandedness reverse - defines the strandedness of the RNA-Seq reads
gene_id transcript_id(s) length effective_length expected_count TPM FPKM posterior_mean_count
ENSG00000000003.15 ENST00000373020.9,ENST00000494424.1,ENST00000496771.5,ENST00000612152.4,ENST00000614008.4 2632.85 2461.28 991.00 29.20 23.82 991.00
ENSG00000000005.6 ENST00000373031.5,ENST00000485971.1 995.15 823.61 34.00 2.99 2.44 34.00
ENSG00000000419.12 ENST00000371582.8,ENST00000371584.8,ENST00000371588.9,ENST00000413082.1,ENST00000466152.5,ENST00000494752.1 1078.11 906.55 687.00 54.95 44.82 687.00
ENSG00000000457.14 ENST00000367770.5,ENST00000367771.11,ENST00000367772.8,ENST00000423670.1,ENST00000470238.1 4468.29 4296.72 334.00 5.64 4.60 334.00
ENSG00000000460.17 ENST00000286031.10,ENST00000359326.9,ENST00000413811.3,ENST00000459772.5,ENST00000466580.6,ENST00000472795.5,ENST00000481744.5,ENST00000496973.5,ENST00000498289.5 3103.44 2931.88 165.00 4.08 3.33 165.00
ENSG00000000938.13 ENST00000374003.7,ENST00000374004.5,ENST00000374005.8,ENST00000399173.5,ENST00000457296.5,ENST00000468038.1,ENST00000475472.5 2033.32 1861.76 36.00 1.40 1.14 36.00
ENSG00000000971.16 ENST00000359637.2,ENST00000367429.9,ENST00000466229.5,ENST00000470918.1,ENST00000496761.1,ENST00000630130.2 3484.05 3312.48 621.84 13.61 11.10 621.57
ENSG00000001036.14 ENST00000002165.11,ENST00000367585.1,ENST00000451668.1 2356.60 2185.03 1508.00 50.05 40.82 1508.00
ENSG00000001084.13 ENST00000504353.1,ENST00000504525.1,ENST00000505197.1,ENST00000505294.5,ENST00000509541.5,ENST00000510837.5,ENST00000513939.6,ENST00000514004.5,ENST00000514373.3,ENST00000514933.2,ENST00000515580.1,ENST00000616923.5,ENST00000643939.1,ENST00000650454.1 2835.97 2664.41 887.00 24.14 19.69 887.00
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct posterior_mean_count
ENST00000373020.9 ENSG00000000003.15 3768 3596.44 874.43 17.63 14.38 60.39 872.33
ENST00000494424.1 ENSG00000000003.15 820 648.44 0.00 0.00 0.00 0.00 1.26
ENST00000496771.5 ENSG00000000003.15 1025 853.44 2.66 0.23 0.18 0.77 4.23
ENST00000612152.4 ENSG00000000003.15 3796 3624.44 0.00 0.00 0.00 0.00 2.21
ENST00000614008.4 ENSG00000000003.15 900 728.44 113.92 11.34 9.25 38.84 110.97
ENST00000373031.5 ENSG00000000005.6 1205 1033.44 29.16 2.05 1.67 68.35 28.75
ENST00000485971.1 ENSG00000000005.6 542 370.52 4.84 0.95 0.77 31.65 5.25
ENST00000371582.8 ENSG00000000419.12 1161 989.44 26.45 1.94 1.58 3.53 25.05
ENST00000371584.8 ENSG00000000419.12 1073 901.44 18.00 1.45 1.18 2.64 20.08
Last updated
Was this helpful?