Question: Indel concordance from different VCF files
0
gravatar for tejaswikoganti
3 months ago by
United States
tejaswikoganti60 wrote:

Hello,

I am trying to compare SNP's and INDELS from two VCF files that belong to the same sample but run on different instruments.

SNP's are always in the same position, so tools like bedtools work well when you look for common variants in two VCF files. But INDELs might not always be called at the same start position. I was wondering if there are any standard ways to compare indels from two VCF files (and any tools that are helpful)?

Thanks! Teja

variantconcordance • 235 views
ADD COMMENTlink modified 3 months ago by Len Trigg1.1k • written 3 months ago by tejaswikoganti60
0
gravatar for Dan Gaston
3 months ago by
Dan Gaston7.0k
Canada
Dan Gaston7.0k wrote:

You basically need to define what rules you'll use for InDels being concordant. Usually, you'll go with some sort of overlap criteria. You'll also want to be sure you left-normalize all of the input VCFs. But from there, any sort of tools, like BEDTools that looks for intersections between genomic coordinates in different files can be used to get your overlaps and measure concordance.

ADD COMMENTlink written 3 months ago by Dan Gaston7.0k
0
gravatar for Chris Miller
3 months ago by
Chris Miller19k
Washington University in St. Louis, MO
Chris Miller19k wrote:

Try the GATK LeftAlignAndTrimVariants tool, which will at least normalize their positions and help improve concordance.

ADD COMMENTlink written 3 months ago by Chris Miller19k
0
gravatar for tejaswikoganti
3 months ago by
United States
tejaswikoganti60 wrote:

Thanks so much for your responses. After I left normalize them, are there any standard rules at all that anyone has used before to set the overlapping regions?

ADD COMMENTlink written 3 months ago by tejaswikoganti60

I think as long as your rules are reasonable it will be ok. A percent overlap works. You could also do something to merge close together indels that fall within a larger one. For instance, if program A calls a large indel of 100 bp and program B called two different 30bp indels close together you might only want to count that as one overlap versus two.

ADD REPLYlink written 3 months ago by Dan Gaston7.0k
0
gravatar for Len Trigg
3 months ago by
Len Trigg1.1k
New Zealand
Len Trigg1.1k wrote:

Normalization such as left-aligning indels can help for simple cases, but these are not the state of the art for VCF comparison. This is a clear case where you should be using a haplotype-aware VCF comparison tool. As well as being able to deal with situations where an indel is placed at a different start position, these tools can also deal with more complex cases that arise (for example when you have SNPs and indels in close proximity.

I would recommend RTG Tools vcfeval or Illumina's hap.py tool depending on what kind of results you are after. Using vcfeval directly is good if you are wanting to do VCF intersection type operations to find variants in common or only in one of the two call sets. hap.py is a good tool if you are more interested in performance metrics and benchmarking, stratified by region or variant type (and you can use vcfeval as the matching engine inside hap.py for slightly improved comparisons than the built-in haplotype matching).

(disclaimer: I work for RTG)

ADD COMMENTlink written 3 months ago by Len Trigg1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1164 users visited in the last hour