# De novo assembly of RNA-Seq reads

## *De novo* assembly of reads

First, we combine the raw reads from different conditions into a single FASTQ file (for each end) and use Trinity to generate a **reference** assembly.

\* Assume the `fq.gz` files were already placed in the `Trinity/RNASEQ_data` folder

```
mkdir ~/LSLNGS2015/Trinity
mkdir ~/LSLNGS2015/Trinity/RNASEQ_data/
```

### Execute

```
cd ~/LSLNGS2015/Trinity

zcat RNASEQ_data/*.left.fq.gz | gzip > RNASEQ_data/sp.left.fq.gz
zcat RNASEQ_data/*.right.fq.gz | gzip > RNASEQ_data/sp.right.fq.gz

bsub -q 16G -o ./trinity_reference.std -e ./trinity_reference.err -J Trinity \
"Trinity --seqType fq --SS_lib_type RF \
--left RNASEQ_data/sp.left.fq.gz --right RNASEQ_data/sp.right.fq.gz \
--CPU 10 --max_memory 16G --output trinity_reference >& trinity_reference.log"
```

You will get a message like `Job <xxxxxx> is submitted to queue <16G>.` to let you know your submission is successful.

You can use `bjobs` to check your job status.

```
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
993732  s00ycm0 RUN   16G        alps1       8*alps1-40  Trinity    Nov 18 22:38
```

### Resource usage

| ALPS Queue Name | CPU Time     | Max Memory | Duration             |
| --------------- | ------------ | ---------- | -------------------- |
| 16G             | 2097.24 sec. | 2 GB       | 6 minutes 37 seconds |

The assembled transcripts can be found at `trinity_reference/Trinity.fasta`.

## Examine assembly stats

The `TrinityStats.pl` script will report the basic statistics about the assembly produced by Trinity. The numbers may vary slightly, as the assembly results are not deterministic.

### Execute

Locate `util/TrinityStats.pl` in the trinityrnaseq-2.2.0 distribution, and run

`PATH_TO_TRINITY/util/TrinityStats.pl trinity_reference/Trinity.fasta`

### Output

```
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  377
Total trinity transcripts:      385
Percent GC: 38.65

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3373
        Contig N20: 2605
        Contig N30: 2219
        Contig N40: 1936
        Contig N50: 1703

        Median contig length: 772
        Average contig: 1046.43
        Total assembled bases: 402875


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3373
        Contig N20: 2605
        Contig N30: 2219
        Contig N40: 1936
        Contig N50: 1682

        Median contig length: 765
        Average contig: 1038.27
        Total assembled bases: 391428
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ycl6.gitbook.io/rna-seq-data-analysis/de_novo_assembly_using_trinity/de_novo_assembly_of_rna-seq_reads.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
