Question: Quallity trimming by phred score, how far is to far?
0
gravatar for robert.murphy
25 days ago by
robert.murphy30 wrote:

When quality trimming raw reads to a phred score with many of the numerous tools out there (I prefer to use bbduk), there is often conflicting advice surrounding to what Q score one should filter too.

I have read many of Brian Bushnell's (the mind behind bbtools) post and replies and from what I gather there really is no right answer as it depends on what you want to do.

Say for example I have some illumina short reads and some pacbio long reads that I will to do the following with:

  1. Hybrid assembly
  2. Illumina only de novo assembly
  3. Pacbio only de novo assembly

I have been previously told that a score of Q30 is highly typical to use however I have read that anything above 27 is just unnecessary and potentially damaging to generating good assemblies. On the bbduk help pages on Seqanswers Q10 is used in every example.

Thus is Q10 a good general base point for quality trimming to provide decent assemblies or is this also to high?

How one one work out the above on their own without needing to come here to answers?

Secondly, should we trim pacbio reads in either a hybrid assembly or pacbio de novo assembly?

sequencing assembly • 123 views
ADD COMMENTlink modified 25 days ago by genomax92k • written 25 days ago by robert.murphy30
2
gravatar for genomax
25 days ago by
genomax92k
United States
genomax92k wrote:

I gather there really is no right answer as it depends on what you want to do.

That is correct. Every dataset is going to have overall characteristics that are dependent on the quality of the source DNA and libraries (and to some extent sequencing). I am not sure if those can be completely captured in a QC report. We have accumulated enough expertise over the years that quality of sequencing should be more or less out of the equation, if the libraries made were of good quality.

If you are working with a de novo genome then it is appropriate to trim at that higher Q25 or more level. If you have a reference genome available that you can compare to then relaxing that threshold may be fine.

we trim pacbio reads in either a hybrid assembly or pacbio de novo assembly?

Definitely trim out barcodes or any other extraneous sequence. Someone else will have to chime in on specifics.

ADD COMMENTlink written 25 days ago by genomax92k

@genomax Thank you very much for the response. It makes sense to me that for de novo assembly one should trim to a higher phred score, would 30 be an acceptable mid ground?

For pacbio reads should you also trim based on quallity?

ADD REPLYlink written 25 days ago by robert.murphy30

Q30 should be a good compromise for de novo work. If your data has sequencing anomalies (e.g. regions of low scores in middle of reads etc perhaps due to a bubble in the lane) then using a windowed Q score average or filterbytile.sh may be safer. Ideally you should discard such data but ...

I have not worked with PacBio data recently but for latest Sequel II data following is noted.

Please note that raw data quality scores are the same for all bases of the Sequel raw data (PHRED 0 — ASCII !). PacBio came to the conclusion that computing the quality scores for the raw data was a waste of time. Apparently the quality scores for the raw data cannot be reliably computed (and consequently these were also ignored for RSII data pipelines).

ADD REPLYlink modified 25 days ago • written 25 days ago by genomax92k

Thank you very much. I have just tried a Q30 filtering run and get back some quite weird results when looking at fastQC.

Pre filtering

Post filtering

ADD REPLYlink modified 25 days ago • written 25 days ago by robert.murphy30

How so? Before filtering all of your reads were 150 bp (FastQC just plots them like that). After filtering you have a range of sizes. You will want to implement some length filtering to eliminate short reads since they may not be very useful for the assembly (which is the next step?).

ADD REPLYlink written 25 days ago by genomax92k

Ahh okay so that is normal, I was not aware that was the case. thank you! I just assume it binned the whole read if had low quality reads on it.

ADD REPLYlink written 25 days ago by robert.murphy30

Is it unusual to get warning on per base sequence content? I have aprox a 20% difference between AT and CT.

ADD REPLYlink written 25 days ago by robert.murphy30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2309 users visited in the last hour