Question

Indel concordance from different VCF files

0

Entering edit mode

6.2 years ago

tejaswikoganti ▴ 70

Hello,

I am trying to compare SNP's and INDELS from two VCF files that belong to the same sample but run on different instruments.

SNP's are always in the same position, so tools like bedtools work well when you look for common variants in two VCF files. But INDELs might not always be called at the same start position. I was wondering if there are any standard ways to compare indels from two VCF files (and any tools that are helpful)?

Thanks! Teja

variantconcordance • 3.2k views

ADD COMMENT • link updated 6.2 years ago by Len Trigg ★ 1.6k • written 6.2 years ago by tejaswikoganti ▴ 70

score 1 · Answer 1 · 2018-02-06

Normalization such as left-aligning indels can help for simple cases, but these are not the state of the art for VCF comparison. This is a clear case where you should be using a haplotype-aware VCF comparison tool. As well as being able to deal with situations where an indel is placed at a different start position, these tools can also deal with more complex cases that arise (for example when you have SNPs and indels in close proximity.

I would recommend RTG Tools vcfeval or Illumina's hap.py tool depending on what kind of results you are after. Using vcfeval directly is good if you are wanting to do VCF intersection type operations to find variants in common or only in one of the two call sets. hap.py is a good tool if you are more interested in performance metrics and benchmarking, stratified by region or variant type (and you can use vcfeval as the matching engine inside hap.py for slightly improved comparisons than the built-in haplotype matching).

(disclaimer: I work for RTG)

score 0 · Answer 2 · 2018-02-06

You basically need to define what rules you'll use for InDels being concordant. Usually, you'll go with some sort of overlap criteria. You'll also want to be sure you left-normalize all of the input VCFs. But from there, any sort of tools, like BEDTools that looks for intersections between genomic coordinates in different files can be used to get your overlaps and measure concordance.

score 0 · Answer 3 · 2018-02-06

0

Entering edit mode

6.2 years ago

Chris Miller 22k

Try the GATK LeftAlignAndTrimVariants tool, which will at least normalize their positions and help improve concordance.

ADD COMMENT • link 6.2 years ago by Chris Miller 22k

score 0 · Answer 4 · 2018-02-06

0

Entering edit mode

6.2 years ago

tejaswikoganti ▴ 70

Thanks so much for your responses. After I left normalize them, are there any standard rules at all that anyone has used before to set the overlapping regions?

ADD COMMENT • link 6.2 years ago by tejaswikoganti ▴ 70

0

Entering edit mode

I think as long as your rules are reasonable it will be ok. A percent overlap works. You could also do something to merge close together indels that fall within a larger one. For instance, if program A calls a large indel of 100 bp and program B called two different 30bp indels close together you might only want to count that as one overlap versus two.

ADD REPLY • link 6.2 years ago by DG 7.3k