What are the range or ceiling of metrics like DP4 , MQP, etc to filter variants?
1
0
Entering edit mode
6.1 years ago
Dayna ▴ 50

Hi

Do you know what are the range of the following metrics? When it says bigger is better, I don't know the ceiling to decide, like if the maximum possible value is 1 then 0.9 is big, and if the maximum is 10 or 50 then 0.9 is low. I am sorry, I am a very beginner.

##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">

Thanks

variant calling • 1.6k views
ADD COMMENT
0
0
Entering edit mode

Some values doesn't exist because this is not gatk pipeline, this samtools and bcftools, like dp4 ..etc. Yes, I understand gatk is better but as a start and for benchmarking, i need to start with samtools.

ADD REPLY
1
Entering edit mode
6.1 years ago

This is somewhat an open-ended question that could make for a philosophical debate regarding infinities, etc..

Generally, you could say the following:

Metrics that are based on depth of coverage or read depth:

  • min = 0
  • max = roughly the target depth of coverage of the sequence run (number of cycles)

Metrics that are based on probabilities (P values):

  • min = 0
  • max = 1

Regarding the first class of metric (i.e. depth of coverage or read depth), in order to make the analysis more streamline, a variant caller will generally only look at the first 500-1000 reads that it finds (which is biased, as I'm sure you're imagining right now).

Regarding the metrics based on probabilities, these may be represented as the negative log base 10 of the P value, i.e., Phred scores, in which case larger numbers signify a greater chance that we can shun the null hypothesis. The QUAL scores in a VCF, for example, are Phred-scores.

Kevin

ADD COMMENT
0
Entering edit mode

Thanks Kevin a lot . But this seems fuzzy logic to me as a beginner, when I look at a number, and I can't even judge to discard or keep as no rule

ADD REPLY
1
Entering edit mode

If you are a beginner at this but you have used a 'trusted' analysis pipeline to process the data, then (most likely), any variants that have failed a particular metric will have a value other than PASS in the FILTER column of the VCF.

ADD REPLY
1
Entering edit mode

that's really helpful Kevin, thank you a lot

ADD REPLY

Login before adding your answer.

Traffic: 2047 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6