> For the complete documentation index, see [llms.txt](https://ycl6.gitbook.io/rna-seq-data-analysis/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ycl6.gitbook.io/rna-seq-data-analysis/de_novo_assembly_using_trinity.md).

# De novo assembly using Trinity

Trinity is one of the most popular software package for efficient and robust *de novo* reconstruction of transcriptomes from RNA-Seq data. It consists of three software modules, Inchworm, Chrysalis and Butterfly, that run sequentially to process the sequencing reads.

> Quote from [Trinity](https://github.com/trinityrnaseq/trinityrnaseq) GitHub:
>
> * **Inchworm** assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
> * **Chrysalis** clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
> * **Butterfly** then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.

## Materials

The Trinity developers have provided [training materials](https://github.com/trinityrnaseq/RNASeq_Trinity_Tuxedo_Workshop/wiki), and the raw data and the software required are built into a VirtualBox image (Trinity2015.ova). I have saved a copy on **ALPS1**. The RNA-Seq data are 76 bp strand-specific Illumina RNA-Seq paired-end reads derived from *Schizosaccharomyces pombe* (fission yeast) grown under 4 conditions:&#x20;

1. logarithmic growth (Sp\_log)
2. plateau phase (Sp\_plat)
3. heat shock (Sp\_hs)
4. diauxic shift (Sp\_ds)

\* Due to the space limitation of gitbook, I will not provide the `fq.gz` files here, please obtain these files from the VirtualBox image \[[Link](https://data.broadinstitute.org/Trinity/RNASEQ_WORKSHOP/Trinity2015.ova)]

```
-rw-rw-r-- 1 ycl6 ycl6  5790168 Oct 27 11:35 RNASEQ_data/Sp_ds.left.fq.gz
-rw-rw-r-- 1 ycl6 ycl6  5590326 Oct 27 11:35 RNASEQ_data/Sp_ds.right.fq.gz
-rw-rw-r-- 1 ycl6 ycl6  5815390 Oct 27 11:35 RNASEQ_data/Sp_hs.left.fq.gz
-rw-rw-r-- 1 ycl6 ycl6  5751383 Oct 27 11:36 RNASEQ_data/Sp_hs.right.fq.gz
-rw-rw-r-- 1 ycl6 ycl6  2154125 Oct 27 11:36 RNASEQ_data/Sp_log.left.fq.gz
-rw-rw-r-- 1 ycl6 ycl6  2097534 Oct 27 11:36 RNASEQ_data/Sp_log.right.fq.gz
-rw-rw-r-- 1 ycl6 ycl6  5488286 Oct 27 11:36 RNASEQ_data/Sp_plat.left.fq.gz
-rw-rw-r-- 1 ycl6 ycl6  5238362 Oct 27 11:36 RNASEQ_data/Sp_plat.right.fq.gz
```

## Software

### [Trinity](https://github.com/trinityrnaseq/trinityrnaseq)

* **v2.2.0** \[17 Mar 2016] - Latest version available at the time of writing and used in this exercise
* v2.0.6 \[13 Mar 2015] - Latest version available on **ALPS1**

### [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml)

* **v1.1.2** \[23 Jun 2015] - Latest version available at the time of writing and used in this exercise
* v1.0.1 \[14 Mar 2014] - Latest version available on **ALPS1**

### [GMAP](http://research-pub.gene.com/gmap) (Genomic Mapping and Alignment Program)

* **v2016-09-23** - Latest version available at the time of writing and used in this exercise

### [STAR](https://github.com/alexdobin/STAR) (Spliced Transcripts Alignment to a Reference)

* **v2.5.2b** \[20 Aug 2016] - Latest version available at the time of writing and used in this exercise
* v2.3.0e \[14 Feb 2013] - Latest version available on **ALPS1**

### [SAMtools](http://www.htslib.org/)

* **v1.3.1** \[22 Apr 2016] - Latest version available at the time of writing and used in this exercise
* v1.2 \[02 Feb 2015] - Latest version available on **ALPS1**

### [RSEM](https://github.com/deweylab/RSEM) (RNA-Seq by Expectation-Maximization)

* v1.3.0 \[02 Oct 2016] - Latest version available at the time of writing
* **v1.2.31** \[04 Jun 2016] - Version used in this exercise
* v1.2.19 \[05 Nov 2014] - Latest version available on **ALPS1**

## Set JAVA\_HOME and PATH

Bowtie 1 (*NOT* Bowtie 2) is required by the **Chrysalis** module.

\* Below is an example showing how to set up the paths, please remember to change the paths to these binaries accordingly.

```
cd ~/

export JAVA_HOME=/pkg/java/jdk1.7.0_51/bin/java

export PATH=/pkg/java/jdk1.7.0_51/bin:/pkg/biology/Bowtie/bowtie-1.0.1:\
/work3/LSLNGS2015/Tools/RSEM-1.2.23:/pkg/biology/R/R-3.1.2/bin:\
/work3/LSLNGS2015/Tools/gmap-2015-09-29/bin:/pkg/biology/samtools/samtools-1.2:\
/work3/LSLNGS2015/Tools/STAR-STAR_2.4.2a/bin/Linux_x86_64_static:\
/pkg/biology/trinity/trinityrnaseq-2.0.6:$PATH
```

You can use `echo $PATH` to check the new PATH variable.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ycl6.gitbook.io/rna-seq-data-analysis/de_novo_assembly_using_trinity.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
