Question: Determining inhouse cut offs
gravatar for skbrimer
3.5 years ago by
United States
skbrimer610 wrote:

Hello group,

My boss wants a depth of coverage vs quality of data (i.e. Q10=100x, Q20 =50, Q30=25, etc...) so I'm not sure how to do this since so much of what we do the answers is "it depends". I need someone to explain it like I'm 5 (also my favorite reddit channel).

From my understanding of phred scores Q10 = 0.9 chance of any base in a read being correct. Which should mean that in a 100bp read if the mean phred score is 10 I could have 10 random bases incorrect in the read. However the odds of any random base being correct in the same place more than one time increase exponentially as well. Which would imply that I could have a few reads covering an area with a mean phred score of 10 and still be able to accurately call a SNP with as little at 3x coverage.

using the following:

P=.9 the probability for being right

Po = (1-P)^n the probability for being wrong, where n=the #of observations

so for 1 observation Po would equal 0.1, 2 obs =.01, 3 obs = .001, etc...

This doesn't seem to jive with the current practices and I'm not sure what I am missing something. Can someone point me to a good reference or explain to me where I am wrong. I would really appreciate it.

Thanks, Sean

coverage bioinformatics • 810 views
ADD COMMENTlink written 3.5 years ago by skbrimer610

Your VCF should contain variant quality score as well as depth of coverage. If you plot both of those, you should see some correlation. You could also see at what coverage the quality scores become reasonable (which will depend on the caller).

A similar experiment would be to split the FASTQ into two. Call variants in each. Compare depth of coverage in one sample versus the other for all called variants. There should be poor correlation at low coverage and high correlation at high coverage. Determine where that border is.

There are more complicated approaches if you want to turn this into a paper, but this might be enough to answer your actual question.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by igor9.2k

Great idea! Thank you, I will try them and see how they shake out!

ADD REPLYlink written 3.5 years ago by skbrimer610

Not entirely what you are looking for, but the authors of this paper investigated required coverage vs change of detecting variants:

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by WouterDeCoster42k

Thank you for the link! I will have to read it a few times for it all to sink in but this is helpful!

ADD REPLYlink written 3.5 years ago by skbrimer610

It's a interesting article to have an argument for requiring a certain coverage, instead of people just without reason choosing a cut-off of e.g. 20x.

In addition, the coverage will not always be linear to your likelihood of having a correct variant call, for example the presence of elements such as a short tandem repeat, a homopolymer or a segmental duplication with a paralogous sequence variant.

The coverage of a position is a rough parameter about the likelihood of variant identification, and the variant quality score (among other parameters such as strand bias) takes the coverage into account.

It's important to make a clear distinction between base quality, variant quality (and mapping quality).

ADD REPLYlink written 3.5 years ago by WouterDeCoster42k

These are excellent points, thank you for the insight!

ADD REPLYlink written 3.5 years ago by skbrimer610
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1032 users visited in the last hour