Question

Pre-processing And Quality Control of the Raw NGS Data -- > trimming

0

Entering edit mode

9.4 years ago

David_emir ▴ 490

Hello All,

I am working on a NGS data [RNA-Seq] which is paired-end reads and the sequencing length is 100 bp from Illuminas Hiseq platform.

I am new to this , and I am working on this for my thesis. I am struck in QC trimming part. I am unable to decide which on part of length with low quality score, so that I can continue with the next step of trimming using fastx_trimmer.

I followed the following steps:

Step 1: Download the SRA files And Install Related Software

Step 2: Convert SRA Format to FASTQ Format

Step 3: Filter the Lower Quality Data

3.1 Calculated the corresponding quality score value :

 [root@BIO-DT-415 Excersise]# /usr/local/bin/fastx_quality_stats \
  -i SRR1604991_1.fastq \
  -o SRR1604991_1.txt \
  -Q33 | \
  /usr/local/bin/fastx_quality_stats \
  -i SRR1604991_2.fastq \
  -o SRR1604991_2.txt\
   -Q33

3.2 Box plotting

[root@BIO-DT-415 Excersise]# fastq_quality_boxplot_graph.sh \
  -i SRR1604991_1.txt \
  -o SRR1604991_1.png \
  -t "RNA_SEQ_GRA_PLOT "\
   -Q33 | \
  fastq_quality_boxplot_graph.sh \
  -i SRR1604991_2.txt \
  -o SRR1604991_2.png \
  -t "RNA_SEQ_GRA_PLOT " \
  -Q33

3.3 Using --> FASTQ Quality Filter :

[root@BIO-DT-415 Excersise]# fastq_quality_filter \
  -q 20 \
  -p 80 \
  -i SRR1604991_1.fastq \
  -o SRR1604991_1_quality_filter.fastq \
  -Q33 | \
  fastq_quality_filter \
  -q 20 \
  -p 80 \
  -i SRR1604991_2.fastq \
  -o SRR1604991_2_quality_filter.fastq \
  -Q33

3.3.3 After this, I converted it to .txt format to get nucleotide quality information using fastx_quality_stats

[root@BIO-DT-415 Excersise]# /usr/local/bin/fastx_quality_stats \
  -i SRR1604991_1_quality_filter.fastq \
  -o SRR1604991_1_quality_filter.txt \
  -Q33 | \
  /usr/local/bin/fastx_quality_stats \
  -i SRR1604991_2_quality_filter.fastq \
  -o SRR1604991_2_quality_filter.txt \
  -Q33

3.3.3.4 I did box plotting the quality scores using fastq_quality_boxplot_graph.sh

[root@BIO-DT-415 Excersise]# fastq_quality_boxplot_graph.sh \
  -i SRR1604991_1_quality_filterA.txt \
  -o SRR1604991_1_quality_filter.png \
  -t "RNA_SEQ_GRA_PLOT " \
  -Q33 | \
  fastq_quality_boxplot_graph.sh \
  -i SRR1604991_2_quality_filter.txt \
  -o SRR1604991_2__quality_filter.png \
  -t "RNA_SEQ_GRA_PLOT " \
  -Q33

Please have a look onto the graph which I have shared:

https://plus.google.com/+AteeqKhaliq/posts/gWQ8tzb9ytp?pid=6083369332084093970&oid=107068338849842971135

Now The Real PROBLEM:

I tried understanding the output/graph. but I was not able to determine which reads should be trimmed.

From the SRR1604991_1_quality_filter.png, from the 59 bp, The quality of the sequencing reduce sharply. Should I trim a part of length with low quality score before we executed any command to filter low-quality score reads. I am using fastx_trimmer to do that. How to determine the part of length with low quality score from the box plot?

I know its a very lengthy question, but believe me I did every thing possible from my end before putting this question for you guys.

Please help me out, Let me know how to interpret the data from the graph so that I can proceed further.

Thanks a Ton,

Have a great Day ahead !!!

-Ateeq Khaliq

RNA-Seq • 5.9k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by David_emir ▴ 490

Ram · Accepted Answer · 2014-11-19

3

Entering edit mode

9.4 years ago

Brian Bushnell 20k

I absolutely agree that fastx tools should be avoided. But in my testing, BBDuk greatly outperforms Trimmomatic (and Cutadapt, Skewer, and Trimgalore, which is just a wrapper) in speed, sensitivity, and specificity, for both quality- and adapter-trimming. For quality-trimming it's provably superior (as it uses the optimal Phred algorithm, which is not used by any other current program, to my knowledge, other than seqtk), though for adapter-trimming the evidence is only empirical. And (in my opinion) it's much easier to use than Trimmomatic, but that's subjective. Also note that I am biased since I wrote BBDuk.

Also - BBDuk can replicate a lot of the statistics you get from FastQC, but not all of them, and not as conveniently (it outputs text, while FastQC outputs images). So I also highly recommend that as another great tool.

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 9.4 years ago by Brian Bushnell 20k

1

Entering edit mode

Thanks a lot Brain, I will try using BBDuk as well !!!

Have a great day ahead !!!

ADD REPLY • link 9.4 years ago by David_emir ▴ 490

0

Entering edit mode

I will, thanks =)

ADD REPLY • link 9.4 years ago by Brian Bushnell 20k

1

Entering edit mode

I always forget to mention BBDuk...I really need to break that habit!

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

score 2 · Accepted Answer · 2014-11-19

2

Entering edit mode

9.4 years ago

Devon Ryan 104k

Just use FastQC and make your life easier. It can do all of the plotting and everything in a single step
Don't use fastx tools

From the plots you showed, your reads won't need much of any trimming. Just put them through trimmomatic/trim_galore/skewer/etc. to adapter and quality trim them (don't aggressively quality trim...just remove bases from the ends below Q5 or so).

ADD COMMENT • link 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks a Lot Devon, This means lot to me. But, how to determine the quality of the reads from the plot ? It would be great if you please let me know the interpretation of the plot? When should i put a read to trim (WRT plots)?

Thanks a lot ! !!

ADD REPLY • link 9.4 years ago by David_emir ▴ 490

1

Entering edit mode

You'll trim the entire file and not bother with reads individually (there could be tens of millions of them...that'd take a while). Plots like this won't determine whether you trim or not, as you should always adapter and quality trim. However, if a plot like this dips to 0 (or near it) in the middle then you know that a bubble went through the flow cell and you'll have slightly fewer alignments. Similarly, if the quality really plummets (going to 30 isn't plummeting, that's just a slight decrease) near then end, then you know the trimmer is going to remove more and you might lose some reads. Anyway, just give the trimmer (not one from fastx tools!) both files and the parameters for trimming and let it do its thing.

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

1

Entering edit mode

As Devon had said, looking to the plots of FastQC, you'll see clearer the quality of the reads. Try it :). Also there is a very well redacted documentation and examples of FastQC output plots here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

ADD REPLY • link 9.4 years ago by iraun 6.2k

0

Entering edit mode

Thanks Airan !!!

ADD REPLY • link 9.4 years ago by David_emir ▴ 490

0

Entering edit mode

Thanks a Lot Devon, I will try that !!!

ADD REPLY • link 9.4 years ago by David_emir ▴ 490