I am working on a NGS data [RNA-Seq] which is paired-end reads and the sequencing length is 100 bp from Illuminas Hiseq platform.
I am new to this , and i am working on this for my thesis. I am struck in QC trimming part. I am unable to decide which on part of length with low quality score, so that i can continue with the next step of trimming using fastx_trimmer.
I followed the following steps:
Step 1: Download the SRA files And Install Related Software
Step 2: Convert SRA Format to FASTQ Format
Step 3: Filter the Lower Quality Data
3.1 Calculated the corresponding quality score value :
[root@BIO-DT-415 Excersise]# /usr/local/bin/fastx_quality_stats -i SRR1604991_1.fastq -o SRR1604991_1.txt -Q33 | /usr/local/bin/fastx_quality_stats -i SRR1604991_2.fastq -o SRR1604991_2.txt -Q33
3.2 Box plotting
[root@BIO-DT-415 Excersise]# fastq_quality_boxplot_graph.sh -i SRR1604991_1.txt -o SRR1604991_1.png -t “RNA_SEQ_GRA_PLOT ” -Q33 | fastq_quality_boxplot_graph.sh -i SRR1604991_2.txt -o SRR1604991_2.png -t “RNA_SEQ_GRA_PLOT ” -Q33
3.3 Using --> FASTQ Quality Filter :
[root@BIO-DT-415 Excersise]# fastq_quality_filter -q 20 -p 80 -i SRR1604991_1.fastq -o SRR1604991_1_quality_filter.fastq -Q33 | fastq_quality_filter -q 20 -p 80 -i SRR1604991_2.fastq -o SRR1604991_2_quality_filter.fastq -Q33
3.3.3 After this, i converted it to .txt format to get nucleotide quality information using fastx_quality_stats
[root@BIO-DT-415 Excersise]# /usr/local/bin/fastx_quality_stats -i SRR1604991_1_quality_filter.fastq -o SRR1604991_1_quality_filter.txt -Q33 | /usr/local/bin/fastx_quality_stats -i SRR1604991_2_quality_filter.fastq -o SRR1604991_2_quality_filter.txt -Q33
18.104.22.168 I did box plotting the quality scores using fastq_quality_boxplot_graph.sh
[root@BIO-DT-415 Excersise]# fastq_quality_boxplot_graph.sh -i SRR1604991_1_quality_filterA.txt -o SRR1604991_1_quality_filter.png -t “RNA_SEQ_GRA_PLOT ” -Q33 | fastq_quality_boxplot_graph.sh -i SRR1604991_2_quality_filter.txt -o SRR1604991_2__quality_filter.png -t “RNA_SEQ_GRA_PLOT ” -Q33
Please have a look onto the graph which i have shared:
Now The Real PROBLEM :
I tried understanding the output/graph. but i was not able to determine which reads should be trimmed.
From the SRR1604991_1_quality_filter.png, from the 59 bp, The quality of the sequencing reduce sharply. Should I trim a part of length with low quality score before we executed any command to filter low-quality score reads.
I am using fastx_trimmer to do that. How to determine the part of length with low quality score from the box plot?
I know its a very lengthy question, but believe me i did every thing possible from my end before putting this question for you guys.
Please help me out, Let me know how to interpret the data from the graph so that i can proceed further.
Thanks a Ton,
Have a great Day ahead !!!