How To Calculate Genotype Concordance Between Indel Call Sets?
1
2
Entering edit mode
10.8 years ago
William ★ 5.3k

How do people calculate genotype concordance between INDEL call sets?

I have a NGS and BAC based INDEL call set (in vcf) but I get a higher false positive and false negative rate than mentioned in some papers that used similar data and variant call tools. (those papers of course don't mention how they exactly calculated the genotype concordance for INDELS ).

When I inspect the discordant calls in IGV a lot are in or close to repeat regions and some NGS and BAC based INDEL calls are very close to each other, or even overlap.

So far I have used GATK GenotypeCorcordance, which only considers an exact position match and allele match as a true positive.

Should I consider INDELS that match position but have a different allele (lenght) as a true positive match?

Should I exclude INDELS within repeat regions from the genotype concordance calculation? Is there a way to normalize INDEL calls within repetitive regions (ie put them all in the same spot in the repetitive region)?

Should I exclude INDELS within flanks of repeat regions (1 bp?, 5 bp?, 10 bp? ) from the genotype concordance calculation?

Should I exclude INDELS within INDEL clusters (2 indel calls in 10 bp?) from the genotype concordance calculation?

Should I consider INDELS have some overlap (1 bp, 5 bp, 10 bp ) as a true positive match?

Should I consider INDELS that are close to each other (1 bp, 5 bp, 10 bp ) as a true positive match?

Should I only look at INDELS up to 10 bp?

Is there a tool other than GATK GenotypeCorcordance that I can use to calculate the genotype concordance between 2 INDEL call sets ( which maybe also regards the things mentioned above) ?

indel genotype vcf • 4.4k views
ADD COMMENT
0
Entering edit mode
10.8 years ago
William ★ 5.3k

One thing done is left normalizing using x bp windows:

For indel calls, initial agreement between SOAPindel, SAMtools and GATK was very low at 3.0% (see Additional file 1, Figure S8). Indel coordinates were subsequently left-normalized and intervalized using a total range of 20 genomic coordinates (10 bp in each direction of their genomic coordinates)

From: Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing http://genomemedicine.com/content/5/3/28

ADD COMMENT

Login before adding your answer.

Traffic: 2523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6