How To Represent Two Different Indels At The Same Position In A Multisample Vcf?
1
0
Entering edit mode
9.0 years ago
Luca Beltrame ▴ 240

While working to get this issue fixed in VarScan, I'm attempting to generate (or rather correct from the original output) a VCF record for two samples, each with a different indel at the same position.

To make it simple, the situation is:

• First reference base: C
• Indel in sample 1: CAA -> C (loss of 2 bases)
• Indel in sample 2: CA -> C (loss of 1 base)

I know from the data that this is likely an artifact (low coverage region) but still I need to generate a proper record for it or my analysis pipeline will not work (the GATK will complain about an invalid record, see the last post in the link for more details).

How would I go to represent this in a VCF? In particular, how should I represent the REF and ALT records? Should I split this in two records, or keep everything in one?

Thanks!

vcf variant-calling sequencing • 3.4k views
1
Entering edit mode

For now I'm assuming that the reference sequence is the longest (CAA) , sample 1 has C as ALT allele, and sample 2 CA as ALT (so ALT is C,CA). Am I going in the right direction?

0
Entering edit mode

That's how I would also read the VCF spec. (namely, REF= CAA and ALT= CA,C).

0
Entering edit mode

what about just using the comma to separate all the possible variants, in the ALT column?

0
Entering edit mode

But in one case the reference would be CAA, and in the other CA. In both cases the deletion is represented as C, but it is the affected reference sequence that changes.

0
Entering edit mode

In principle there it should be only one reference allele. What is the sequence of the reference genome at NCBI, for that position?

0
Entering edit mode

The problem is how to make it "proper" inside the VCF. The first base in the reference is C. Then we have a stretch of As. So (see my comment below) in fact it is the REF bit that should be writen in a different way.

3
Entering edit mode
9.0 years ago
Erik Garrison ★ 2.3k

You can combine these variants using the vcfmulti tool in vcflib:

<broken.vcf vcfcreatemulti >ok.vcf


However, this won't really handle the sample genotypes. These need to be recreated relative to each other. Ideally, this reconstruction should respect the underlying sequence reads.

Another approach would be to use a variant detection method that calls the samples and overlapping alleles jointly. I don't know the details of your pipeline, so perhaps this is inapplicable.

0
Entering edit mode

I'll look into it, thanks. The issue here is a bug in VarScan, which was reported. However I'll see whether vcfmulti can work as stopgap solution.