QC & trimming

QC

The trimming process is run with 2 threads (-t 2) and took about 1.3 hours to complete. Results are placed in the fastqc folder

$ cd /home/USER/SSAPs
$ mkdir fastqc

$ declare -a runname=("ERR2675454" "ERR2675455" "ERR2675458" "ERR2675459" "ERR2675460" "ERR2675461" "ERR2675464" "ERR2675465" "ERR2675468" "ERR2675469" "ERR2675472" "ERR2675473" "ERR2675476" "ERR2675477" "ERR2675478" "ERR2675479" "ERR2675480" "ERR2675481" "ERR2675484" "ERR2675485")

for id in ${runname[@]}; do
        fq1=fastqs/${id}_1.fastq.gz
        fq2=fastqs/${id}_2.fastq.gz

        fastqc -t 2 --extract -o fastqc $fq1 $fq2
done

Results can be view by opening the *.html files in web browser or summary.txt andfastqc_data.txt in the output folders

fastqc/ERR2675454_1_fastqc/summary.txt
PASS    Basic Statistics        ERR2675454_1.fastq.gz
PASS    Per base sequence quality       ERR2675454_1.fastq.gz
PASS    Per tile sequence quality       ERR2675454_1.fastq.gz
PASS    Per sequence quality scores     ERR2675454_1.fastq.gz
WARN    Per base sequence content       ERR2675454_1.fastq.gz
PASS    Per sequence GC content ERR2675454_1.fastq.gz
PASS    Per base N content      ERR2675454_1.fastq.gz
PASS    Sequence Length Distribution    ERR2675454_1.fastq.gz
FAIL    Sequence Duplication Levels     ERR2675454_1.fastq.gz
PASS    Overrepresented sequences       ERR2675454_1.fastq.gz
FAIL    Adapter Content ERR2675454_1.fastq.gz
fastqc/ERR2675454_1_fastqc/fastqc_data.txt
##FastQC        0.11.9
>>Basic Statistics      pass
#Measure        Value
Filename        ERR2675454_1.fastq.gz
File type       Conventional base calls
Encoding        Sanger / Illumina 1.9
Total Sequences 30273560
Sequences flagged as poor quality       0
Sequence length 151
%GC     47
>>END_MODULE

Per base sequence quality of ERR2675454_1.fastq.gz

Adapter removal and trimming

The trimming process is run with 6 threads (threads=6) and took about 1.6 hours to complete.

$ mkdir trimmed

for id in ${runname[@]}; do
        adapters=/home/USER/tools/bbmap/resources/adapters.fa
        fq1=fastqs/${id}_1.fastq.gz
        fq2=fastqs/${id}_2.fastq.gz
        trim1=trimmed/${id}_1.fastq.gz
        trim2=trimmed/${id}_2.fastq.gz
        log=trimmed/${id}.log

        bbduk.sh threads=6 in1=$fq1 in2=$fq2 out1=$trim1 out2=$trim2 \
        ref=$adapters tbo tpe ktrim=r k=21 mink=9 hdist=1 \
        qtrim=rl trimq=15 minlength=36 maxns=1 2> $log
done
# BBDuk parameters
tbo          - trim adapters based on pair overlap detection using BBMerge
tpe          - trim both reads to the same length
ktrim=r      - once a reference kmer is matched in a read, that kmer and all the bases to the right will be trimmed, leaving only the bases to the left
k=21         - Kmer length used for finding contaminants
mink=9       - look for shorter kmers at read tips down to 9
hdist=1      - maximum Hamming distance for ref kmers
qtrim=rl trimq=15 - quality-trim to Q15 using the Phred algorithm for both sides
minlength=36 - discard reads shorter than 36 bp after trimming
maxns=1      - discard reads with more Ns than 1 after trimming
$ cd trimmed
$ ls
ERR2675454_1.fastq.gz  ERR2675461_1.fastq.gz  ERR2675472_1.fastq.gz  ERR2675479_1.fastq.gz
ERR2675454_2.fastq.gz  ERR2675461_2.fastq.gz  ERR2675472_2.fastq.gz  ERR2675479_2.fastq.gz
ERR2675454.log         ERR2675461.log         ERR2675472.log         ERR2675479.log
ERR2675455_1.fastq.gz  ERR2675464_1.fastq.gz  ERR2675473_1.fastq.gz  ERR2675480_1.fastq.gz
ERR2675455_2.fastq.gz  ERR2675464_2.fastq.gz  ERR2675473_2.fastq.gz  ERR2675480_2.fastq.gz
ERR2675455.log         ERR2675464.log         ERR2675473.log         ERR2675480.log
ERR2675458_1.fastq.gz  ERR2675465_1.fastq.gz  ERR2675476_1.fastq.gz  ERR2675481_1.fastq.gz
ERR2675458_2.fastq.gz  ERR2675465_2.fastq.gz  ERR2675476_2.fastq.gz  ERR2675481_2.fastq.gz
ERR2675458.log         ERR2675465.log         ERR2675476.log         ERR2675481.log
ERR2675459_1.fastq.gz  ERR2675468_1.fastq.gz  ERR2675477_1.fastq.gz  ERR2675484_1.fastq.gz
ERR2675459_2.fastq.gz  ERR2675468_2.fastq.gz  ERR2675477_2.fastq.gz  ERR2675484_2.fastq.gz
ERR2675459.log         ERR2675468.log         ERR2675477.log         ERR2675484.log
ERR2675460_1.fastq.gz  ERR2675469_1.fastq.gz  ERR2675478_1.fastq.gz  ERR2675485_1.fastq.gz
ERR2675460_2.fastq.gz  ERR2675469_2.fastq.gz  ERR2675478_2.fastq.gz  ERR2675485_2.fastq.gz
ERR2675460.log         ERR2675469.log         ERR2675478.log         ERR2675485.log

Generated log files contain information about the number of reads and bases removed and passed the trimming processing

trimmed/ERR2675454.log
Input:                  60547120 reads             9142615120 bases.
QTrimmed:               19033902 reads (31.44%)    366761527 bases (4.01%)
KTrimmed:               26320518 reads (43.47%)    982152960 bases (10.74%)
Trimmed by overlap:     3581100 reads (5.91%)      18700662 bases (0.20%)
Low quality discards:   16230 reads (0.03%)        2046404 bases (0.02%)
Total Removed:          831678 reads (1.37%)       1369661553 bases (14.98%)
Result:                 59715442 reads (98.63%)    7772953567 bases (85.02%)

Time:                           300.984 seconds.
Reads Processed:      60547k    201.16k reads/sec
Bases Processed:       9142m    30.38m bases/sec

Last updated