De novo assembly of RNA-Seq reads

De novo assembly of reads

First, we combine the raw reads from different conditions into a single FASTQ file (for each end) and use Trinity to generate a reference assembly.
* Assume the fq.gz files were already placed in the Trinity/RNASEQ_data folder
1
mkdir ~/LSLNGS2015/Trinity
2
mkdir ~/LSLNGS2015/Trinity/RNASEQ_data/
Copied!

Execute

1
cd ~/LSLNGS2015/Trinity
2
3
zcat RNASEQ_data/*.left.fq.gz | gzip > RNASEQ_data/sp.left.fq.gz
4
zcat RNASEQ_data/*.right.fq.gz | gzip > RNASEQ_data/sp.right.fq.gz
5
6
bsub -q 16G -o ./trinity_reference.std -e ./trinity_reference.err -J Trinity \
7
"Trinity --seqType fq --SS_lib_type RF \
8
--left RNASEQ_data/sp.left.fq.gz --right RNASEQ_data/sp.right.fq.gz \
9
--CPU 10 --max_memory 16G --output trinity_reference >& trinity_reference.log"
Copied!
You will get a message like Job <xxxxxx> is submitted to queue <16G>. to let you know your submission is successful.
You can use bjobs to check your job status.
1
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
2
993732 s00ycm0 RUN 16G alps1 8*alps1-40 Trinity Nov 18 22:38
Copied!

Resource usage

ALPS Queue Name
CPU Time
Max Memory
Duration
16G
2097.24 sec.
2 GB
6 minutes 37 seconds
The assembled transcripts can be found at trinity_reference/Trinity.fasta.

Examine assembly stats

The TrinityStats.pl script will report the basic statistics about the assembly produced by Trinity. The numbers may vary slightly, as the assembly results are not deterministic.

Execute

Locate util/TrinityStats.pl in the trinityrnaseq-2.2.0 distribution, and run
PATH_TO_TRINITY/util/TrinityStats.pl trinity_reference/Trinity.fasta

Output

1
################################
2
## Counts of transcripts, etc.
3
################################
4
Total trinity 'genes': 377
5
Total trinity transcripts: 385
6
Percent GC: 38.65
7
8
########################################
9
Stats based on ALL transcript contigs:
10
########################################
11
12
Contig N10: 3373
13
Contig N20: 2605
14
Contig N30: 2219
15
Contig N40: 1936
16
Contig N50: 1703
17
18
Median contig length: 772
19
Average contig: 1046.43
20
Total assembled bases: 402875
21
22
23
#####################################################
24
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
25
#####################################################
26
27
Contig N10: 3373
28
Contig N20: 2605
29
Contig N30: 2219
30
Contig N40: 1936
31
Contig N50: 1682
32
33
Median contig length: 765
34
Average contig: 1038.27
35
Total assembled bases: 391428
Copied!
Last modified 1yr ago