# QC & trimming

## QC

The trimming process is run with 2 threads (`-t 2`) and took about **1.3** hours to complete. Results are placed in the `fastqc` folder

```bash
$ cd /home/USER/SSAPs
$ mkdir fastqc

$ declare -a runname=("ERR2675454" "ERR2675455" "ERR2675458" "ERR2675459" "ERR2675460" "ERR2675461" "ERR2675464" "ERR2675465" "ERR2675468" "ERR2675469" "ERR2675472" "ERR2675473" "ERR2675476" "ERR2675477" "ERR2675478" "ERR2675479" "ERR2675480" "ERR2675481" "ERR2675484" "ERR2675485")

for id in ${runname[@]}; do
        fq1=fastqs/${id}_1.fastq.gz
        fq2=fastqs/${id}_2.fastq.gz

        fastqc -t 2 --extract -o fastqc $fq1 $fq2
done
```

Results can be view by opening the `*.html` files in web browser or `summary.txt` and`fastqc_data.txt` in the output folders

{% code title="fastqc/ERR2675454\_1\_fastqc/summary.txt" %}

```bash
PASS    Basic Statistics        ERR2675454_1.fastq.gz
PASS    Per base sequence quality       ERR2675454_1.fastq.gz
PASS    Per tile sequence quality       ERR2675454_1.fastq.gz
PASS    Per sequence quality scores     ERR2675454_1.fastq.gz
WARN    Per base sequence content       ERR2675454_1.fastq.gz
PASS    Per sequence GC content ERR2675454_1.fastq.gz
PASS    Per base N content      ERR2675454_1.fastq.gz
PASS    Sequence Length Distribution    ERR2675454_1.fastq.gz
FAIL    Sequence Duplication Levels     ERR2675454_1.fastq.gz
PASS    Overrepresented sequences       ERR2675454_1.fastq.gz
FAIL    Adapter Content ERR2675454_1.fastq.gz
```

{% endcode %}

{% code title="fastqc/ERR2675454\_1\_fastqc/fastqc\_data.txt" %}

```bash
##FastQC        0.11.9
>>Basic Statistics      pass
#Measure        Value
Filename        ERR2675454_1.fastq.gz
File type       Conventional base calls
Encoding        Sanger / Illumina 1.9
Total Sequences 30273560
Sequences flagged as poor quality       0
Sequence length 151
%GC     47
>>END_MODULE
```

{% endcode %}

Per base sequence quality of `ERR2675454_1.fastq.gz`

![](/files/-M5XeZ4OmhrGLWAQLD3u)

## Adapter removal and trimming

The trimming process is run with 6 threads (`threads=6`) and took about **1.6** hours to complete.&#x20;

```bash
$ mkdir trimmed

for id in ${runname[@]}; do
        adapters=/home/USER/tools/bbmap/resources/adapters.fa
        fq1=fastqs/${id}_1.fastq.gz
        fq2=fastqs/${id}_2.fastq.gz
        trim1=trimmed/${id}_1.fastq.gz
        trim2=trimmed/${id}_2.fastq.gz
        log=trimmed/${id}.log

        bbduk.sh threads=6 in1=$fq1 in2=$fq2 out1=$trim1 out2=$trim2 \
        ref=$adapters tbo tpe ktrim=r k=21 mink=9 hdist=1 \
        qtrim=rl trimq=15 minlength=36 maxns=1 2> $log
done
```

```bash
# BBDuk parameters
tbo          - trim adapters based on pair overlap detection using BBMerge
tpe          - trim both reads to the same length
ktrim=r      - once a reference kmer is matched in a read, that kmer and all the bases to the right will be trimmed, leaving only the bases to the left
k=21         - Kmer length used for finding contaminants
mink=9       - look for shorter kmers at read tips down to 9
hdist=1      - maximum Hamming distance for ref kmers
qtrim=rl trimq=15 - quality-trim to Q15 using the Phred algorithm for both sides
minlength=36 - discard reads shorter than 36 bp after trimming
maxns=1      - discard reads with more Ns than 1 after trimming
```

```bash
$ cd trimmed
$ ls
ERR2675454_1.fastq.gz  ERR2675461_1.fastq.gz  ERR2675472_1.fastq.gz  ERR2675479_1.fastq.gz
ERR2675454_2.fastq.gz  ERR2675461_2.fastq.gz  ERR2675472_2.fastq.gz  ERR2675479_2.fastq.gz
ERR2675454.log         ERR2675461.log         ERR2675472.log         ERR2675479.log
ERR2675455_1.fastq.gz  ERR2675464_1.fastq.gz  ERR2675473_1.fastq.gz  ERR2675480_1.fastq.gz
ERR2675455_2.fastq.gz  ERR2675464_2.fastq.gz  ERR2675473_2.fastq.gz  ERR2675480_2.fastq.gz
ERR2675455.log         ERR2675464.log         ERR2675473.log         ERR2675480.log
ERR2675458_1.fastq.gz  ERR2675465_1.fastq.gz  ERR2675476_1.fastq.gz  ERR2675481_1.fastq.gz
ERR2675458_2.fastq.gz  ERR2675465_2.fastq.gz  ERR2675476_2.fastq.gz  ERR2675481_2.fastq.gz
ERR2675458.log         ERR2675465.log         ERR2675476.log         ERR2675481.log
ERR2675459_1.fastq.gz  ERR2675468_1.fastq.gz  ERR2675477_1.fastq.gz  ERR2675484_1.fastq.gz
ERR2675459_2.fastq.gz  ERR2675468_2.fastq.gz  ERR2675477_2.fastq.gz  ERR2675484_2.fastq.gz
ERR2675459.log         ERR2675468.log         ERR2675477.log         ERR2675484.log
ERR2675460_1.fastq.gz  ERR2675469_1.fastq.gz  ERR2675478_1.fastq.gz  ERR2675485_1.fastq.gz
ERR2675460_2.fastq.gz  ERR2675469_2.fastq.gz  ERR2675478_2.fastq.gz  ERR2675485_2.fastq.gz
ERR2675460.log         ERR2675469.log         ERR2675478.log         ERR2675485.log
```

Generated log files contain information about the number of reads and bases removed and passed the trimming processing

{% code title="trimmed/ERR2675454.log" %}

```bash
Input:                  60547120 reads             9142615120 bases.
QTrimmed:               19033902 reads (31.44%)    366761527 bases (4.01%)
KTrimmed:               26320518 reads (43.47%)    982152960 bases (10.74%)
Trimmed by overlap:     3581100 reads (5.91%)      18700662 bases (0.20%)
Low quality discards:   16230 reads (0.03%)        2046404 bases (0.02%)
Total Removed:          831678 reads (1.37%)       1369661553 bases (14.98%)
Result:                 59715442 reads (98.63%)    7772953567 bases (85.02%)

Time:                           300.984 seconds.
Reads Processed:      60547k    201.16k reads/sec
Bases Processed:       9142m    30.38m bases/sec
```

{% endcode %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ycl6.gitbook.io/guide-to-rna-seq-analysis/raw-read-processing/qc-and-trimming.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
