Question: How To Calculate Genotype Concordance Between Indel Call Sets?
2
gravatar for William
5.7 years ago by
William4.4k
Europe
William4.4k wrote:

How do people calculate genotype concordance between INDEL call sets?

I have a NGS and BAC based INDEL call set (in vcf) but I get a higher false positive and false negative rate than mentioned in some papers that used similar data and variant call tools. (those papers of course don't mention how they exactly calculated the genotype concordance for INDELS ).

When I inspect the discordant calls in IGV a lot are in or close to repeat regions and some NGS and BAC based INDEL calls are very close to each other, or even overlap.

So far I have used GATK GenotypeCorcordance, which only considers an exact position match and allele match as a true positive.

Should I consider INDELS that match position but have a different allele (lenght) as a true positive match?

Should I exclude INDELS within repeat regions from the genotype concordance calculation? Is there a way to normalize INDEL calls within repetitive regions (ie put them all in the same spot in the repetitive region)?

Should I exclude INDELS within flanks of repeat regions (1 bp?, 5 bp?, 10 bp? ) from the genotype concordance calculation?

Should I exclude INDELS within INDEL clusters (2 indel calls in 10 bp?) from the genotype concordance calculation?

Should I consider INDELS have some overlap (1 bp, 5 bp, 10 bp ) as a true positive match?

Should I consider INDELS that are close to each other (1 bp, 5 bp, 10 bp ) as a true positive match?

Should I only look at INDELS up to 10 bp?

Is there a tool other than GATK GenotypeCorcordance that I can use to calculate the genotype concordance between 2 INDEL call sets ( which maybe also regards the things mentioned above) ?

indel vcf genotype • 3.0k views
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by William4.4k
0
gravatar for William
5.7 years ago by
William4.4k
Europe
William4.4k wrote:

One thing done is left normalizing using x bp windows:

For indel calls, initial agreement between SOAPindel, SAMtools and GATK was very low at 3.0% (see Additional file 1, Figure S8). Indel coordinates were subsequently left-normalized and intervalized using a total range of 20 genomic coordinates (10 bp in each direction of their genomic coordinates)

From: Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing http://genomemedicine.com/content/5/3/28

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by William4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 967 users visited in the last hour