Question: Pre-processing And Quality Control of the Raw NGS Data -- > trimming
0
4.9 years ago by
David_emir360
India
David_emir360 wrote:

Hello All,

I am working on a NGS data [RNA-Seq] which is paired-end reads and the sequencing length is 100 bp from Illuminas Hiseq platform.

I am new to this , and i am working on this for my thesis. I am struck in QC trimming part. I am unable to decide which on part of length with low quality score, so that i can continue with the next step of trimming using fastx_trimmer.

I followed the following steps:

Step 2: Convert SRA Format to FASTQ Format
Step 3: Filter the Lower Quality Data

3.1 Calculated the corresponding quality score value :

` [root@BIO-DT-415 Excersise]# /usr/local/bin/fastx_quality_stats -i SRR1604991_1.fastq -o SRR1604991_1.txt -Q33 | /usr/local/bin/fastx_quality_stats -i SRR1604991_2.fastq -o SRR1604991_2.txt -Q33`

3.2 Box plotting

`[root@BIO-DT-415 Excersise]# fastq_quality_boxplot_graph.sh -i SRR1604991_1.txt -o SRR1604991_1.png -t “RNA_SEQ_GRA_PLOT ” -Q33 | fastq_quality_boxplot_graph.sh -i SRR1604991_2.txt -o SRR1604991_2.png -t “RNA_SEQ_GRA_PLOT ” -Q33`

3.3 Using --> FASTQ Quality Filter :

`[root@BIO-DT-415 Excersise]# fastq_quality_filter -q 20 -p 80 -i SRR1604991_1.fastq -o SRR1604991_1_quality_filter.fastq -Q33 | fastq_quality_filter -q 20 -p 80 -i SRR1604991_2.fastq -o SRR1604991_2_quality_filter.fastq -Q33`

3.3.3 After this, i  converted it to .txt format to get nucleotide quality information using fastx_quality_stats

`[root@BIO-DT-415 Excersise]# /usr/local/bin/fastx_quality_stats -i SRR1604991_1_quality_filter.fastq -o SRR1604991_1_quality_filter.txt -Q33 | /usr/local/bin/fastx_quality_stats -i SRR1604991_2_quality_filter.fastq -o SRR1604991_2_quality_filter.txt -Q33`

3.3.3.4 I did box plotting the quality scores using fastq_quality_boxplot_graph.sh

`[root@BIO-DT-415 Excersise]# fastq_quality_boxplot_graph.sh -i SRR1604991_1_quality_filterA.txt -o SRR1604991_1_quality_filter.png -t “RNA_SEQ_GRA_PLOT ” -Q33 | fastq_quality_boxplot_graph.sh -i SRR1604991_2_quality_filter.txt -o SRR1604991_2__quality_filter.png -t “RNA_SEQ_GRA_PLOT ” -Q33`

Please have a look onto the graph which i have shared:

Now The Real PROBLEM :

I tried understanding the output/graph. but i was not able to determine which reads should be trimmed.

From the SRR1604991_1_quality_filter.png, from the 59 bp, The quality of the sequencing reduce sharply.  Should I trim a part of length with low quality score before we executed any command to filter low-quality score reads.` I am using fastx_trimmer` to do that. How to determine the part of length with low quality score from the box plot?

I know its a very lengthy question, but believe me i did every thing possible from my end before putting this question for you guys.

Please help me out, Let me know how to interpret the data from the graph so that i can proceed further.

Thanks a Ton,

Have a great Day ahead !!!

-Ateeq Khaliq

rna-seq • 4.5k views
modified 4.9 years ago by Brian Bushnell16k • written 4.9 years ago by David_emir360
3
4.9 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I absolutely agree that fastx tools should be avoided. But in my testing, BBDuk greatly outperforms Trimmomatic (and Cutadapt, Skewer, and Trimgalore, which is just a wrapper) in speed, sensitivity, and specificity, for both quality- and adapter-trimming. For quality-trimming it's provably superior (as it uses the optimal Phred algorithm, which is not used by any other current program, to my knowledge, other than seqtk), though for adapter-trimming the evidence is only empirical. And (in my opinion) it's much easier to use than Trimmomatic, but that's subjective. Also note that I am biased since I wrote BBDuk.

Also - BBDuk can replicate a lot of the statistics you get from FastQC, but not all of them, and not as conveniently (it outputs text, while FastQC outputs images). So I also highly recommend that as another great tool.

1

Thanks a lot Brain, I will try using BBDuk as well !!!

Have a great day ahead !!!

I will, thanks =)

1

I always forget to mention BBDuk...I really need to break that habit!

2
4.9 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:
1. Just use FastQC and make your life easier. It can do all of the plotting and everything in a single step
2. Don't use fastx tools

From the plots you showed, your reads won't need much of any trimming. Just put them through trimmomatic/trim_galore/skewer/etc. to adapter and quality trim them (don't aggressively quality trim...just remove bases from the ends below Q5 or so).

Thanks a Lot Devon, This means lot to me. But, how to determine the quality of the reads from the plot ? It would be great if you please let me know the interpretation of the plot? When should i put a read to trim (WRT plots)?

Thanks a lot ! !!

1

You'll trim the entire file and not bother with reads individually (there could be tens of millions of them...that'd take a while). Plots like this won't determine whether you trim or not, as you should always adapter and quality trim. However, if a plot like this dips to 0 (or near it) in the middle then you know that a bubble went through the flow cell and you'll have slightly fewer alignments. Similarly, if the quality really plummets (going to 30 isn't plummeting, that's just a slight decrease) near then end, then you know the trimmer is going to remove more and you might lose some reads. Anyway, just give the trimmer (not one from fastx tools!) both files and the parameters for trimming and let it do its thing.

1

As Devon had said, looking to the plots of FastQC, you'll see clearer the quality of the reads. Try it :). Also there is a very well redacted documentation and examples of FastQC output plots here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Thanks Airan !!!