Question: Pre-processing And Quality Control of the Raw NGS Data -- > trimming
0
gravatar for David_emir
3.0 years ago by
David_emir200
India
David_emir200 wrote:

Hello All,

I am working on a NGS data [RNA-Seq] which is paired-end reads and the sequencing length is 100 bp from Illuminas Hiseq platform.

I am new to this , and i am working on this for my thesis. I am struck in QC trimming part. I am unable to decide which on part of length with low quality score, so that i can continue with the next step of trimming using fastx_trimmer.

I followed the following steps:

    Step 1: Download the SRA files And Install Related Software
    Step 2: Convert SRA Format to FASTQ Format
    Step 3: Filter the Lower Quality Data

3.1 Calculated the corresponding quality score value :

 [root@BIO-DT-415 Excersise]# /usr/local/bin/fastx_quality_stats -i SRR1604991_1.fastq -o SRR1604991_1.txt -Q33 | /usr/local/bin/fastx_quality_stats -i SRR1604991_2.fastq -o SRR1604991_2.txt -Q33

3.2 Box plotting

[root@BIO-DT-415 Excersise]# fastq_quality_boxplot_graph.sh -i SRR1604991_1.txt -o SRR1604991_1.png -t “RNA_SEQ_GRA_PLOT ” -Q33 | fastq_quality_boxplot_graph.sh -i SRR1604991_2.txt -o SRR1604991_2.png -t “RNA_SEQ_GRA_PLOT ” -Q33

3.3 Using --> FASTQ Quality Filter :

[root@BIO-DT-415 Excersise]# fastq_quality_filter -q 20 -p 80 -i SRR1604991_1.fastq -o SRR1604991_1_quality_filter.fastq -Q33 | fastq_quality_filter -q 20 -p 80 -i SRR1604991_2.fastq -o SRR1604991_2_quality_filter.fastq -Q33

3.3.3 After this, i  converted it to .txt format to get nucleotide quality information using fastx_quality_stats

[root@BIO-DT-415 Excersise]# /usr/local/bin/fastx_quality_stats -i SRR1604991_1_quality_filter.fastq -o SRR1604991_1_quality_filter.txt -Q33 | /usr/local/bin/fastx_quality_stats -i SRR1604991_2_quality_filter.fastq -o SRR1604991_2_quality_filter.txt -Q33

3.3.3.4 I did box plotting the quality scores using fastq_quality_boxplot_graph.sh

[root@BIO-DT-415 Excersise]# fastq_quality_boxplot_graph.sh -i SRR1604991_1_quality_filterA.txt -o SRR1604991_1_quality_filter.png -t “RNA_SEQ_GRA_PLOT ” -Q33 | fastq_quality_boxplot_graph.sh -i SRR1604991_2_quality_filter.txt -o SRR1604991_2__quality_filter.png -t “RNA_SEQ_GRA_PLOT ” -Q33

Please have a look onto the graph which i have shared:

https://plus.google.com/+AteeqKhaliq/posts/gWQ8tzb9ytp?pid=6083369332084093970&oid=107068338849842971135

Now The Real PROBLEM :

I tried understanding the output/graph. but i was not able to determine which reads should be trimmed.

From the SRR1604991_1_quality_filter.png, from the 59 bp, The quality of the sequencing reduce sharply.  Should I trim a part of length with low quality score before we executed any command to filter low-quality score reads. I am using fastx_trimmer to do that. How to determine the part of length with low quality score from the box plot?

I know its a very lengthy question, but believe me i did every thing possible from my end before putting this question for you guys.

Please help me out, Let me know how to interpret the data from the graph so that i can proceed further.

Thanks a Ton,

Have a great Day ahead !!!

-Ateeq Khaliq

rna-seq • 3.3k views
ADD COMMENTlink modified 3.0 years ago by Brian Bushnell14k • written 3.0 years ago by David_emir200
3
gravatar for Brian Bushnell
3.0 years ago by
Walnut Creek, USA
Brian Bushnell14k wrote:

I absolutely agree that fastx tools should be avoided.  But in my testing, BBDuk greatly outperforms Trimmomatic (and Cutadapt, Skewer, and Trimgalore, which is just a wrapper) in speed, sensitivity, and specificity, for both quality- and adapter-trimming.  For quality-trimming it's provably superior (as it uses the optimal Phred algorithm, which is not used by any other current program, to my knowledge, other than seqtk), though for adapter-trimming the evidence is only empirical.  And (in my opinion) it's much easier to use than Trimmomatic, but that's subjective.  Also note that I am biased since I wrote BBDuk.

Also - BBDuk can replicate a lot of the statistics you get from FastQC, but not all of them, and not as conveniently (it outputs text, while FastQC outputs images).  So I also highly recommend that as another great tool.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Brian Bushnell14k
1

Thanks a lot Brain, I will try using BBDuk as well !!!

Have a great day ahead !!!

ADD REPLYlink written 3.0 years ago by David_emir200

I will, thanks =)

ADD REPLYlink written 3.0 years ago by Brian Bushnell14k
1

I always forget to mention BBDuk...I really need to break that habit!

ADD REPLYlink written 3.0 years ago by Devon Ryan73k
2
gravatar for Devon Ryan
3.0 years ago by
Devon Ryan73k
Freiburg, Germany
Devon Ryan73k wrote:
  1. Just use FastQC and make your life easier. It can do all of the plotting and everything in a single step
  2. Don't use fastx tools

From the plots you showed, your reads won't need much of any trimming. Just put them through trimmomatic/trim_galore/skewer/etc. to adapter and quality trim them (don't aggressively quality trim...just remove bases from the ends below Q5 or so).

ADD COMMENTlink written 3.0 years ago by Devon Ryan73k

Thanks a Lot Devon, This means lot to me. But, how to determine the quality of the reads from the plot ? It would be great if you please let me know the interpretation of the plot? When should i put a read to trim (WRT plots)?

Thanks a lot ! !!

ADD REPLYlink written 3.0 years ago by David_emir200
1

You'll trim the entire file and not bother with reads individually (there could be tens of millions of them...that'd take a while). Plots like this won't determine whether you trim or not, as you should always adapter and quality trim. However, if a plot like this dips to 0 (or near it) in the middle then you know that a bubble went through the flow cell and you'll have slightly fewer alignments. Similarly, if the quality really plummets (going to 30 isn't plummeting, that's just a slight decrease) near then end, then you know the trimmer is going to remove more and you might lose some reads. Anyway, just give the trimmer (not one from fastx tools!) both files and the parameters for trimming and let it do its thing.

ADD REPLYlink written 3.0 years ago by Devon Ryan73k
1

As Devon had said, looking to the plots of FastQC, you'll see clearer the quality of the reads. Try it :). Also there is a very well redacted documentation and examples of FastQC output plots here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

ADD REPLYlink written 3.0 years ago by iraun3.2k

Thanks Airan !!!

ADD REPLYlink written 3.0 years ago by David_emir200

Thanks a Lot Devon, I will try that !!!

ADD REPLYlink written 3.0 years ago by David_emir200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 704 users visited in the last hour