How To Represent Two Different Indels At The Same Position In A Multisample Vcf?
1
0
Entering edit mode
10.5 years ago
Luca Beltrame ▴ 240

While working to get this issue fixed in VarScan, I'm attempting to generate (or rather correct from the original output) a VCF record for two samples, each with a different indel at the same position.

To make it simple, the situation is:

  • First reference base: C
  • Indel in sample 1: CAA -> C (loss of 2 bases)
  • Indel in sample 2: CA -> C (loss of 1 base)

I know from the data that this is likely an artifact (low coverage region) but still I need to generate a proper record for it or my analysis pipeline will not work (the GATK will complain about an invalid record, see the last post in the link for more details).

How would I go to represent this in a VCF? In particular, how should I represent the REF and ALT records? Should I split this in two records, or keep everything in one?

Thanks!

vcf variant-calling sequencing • 4.0k views
ADD COMMENT
1
Entering edit mode

For now I'm assuming that the reference sequence is the longest (CAA) , sample 1 has C as ALT allele, and sample 2 CA as ALT (so ALT is C,CA). Am I going in the right direction?

ADD REPLY
0
Entering edit mode

That's how I would also read the VCF spec. (namely, REF= CAA and ALT= CA,C).

ADD REPLY
0
Entering edit mode

what about just using the comma to separate all the possible variants, in the ALT column?

ADD REPLY
0
Entering edit mode

But in one case the reference would be CAA, and in the other CA. In both cases the deletion is represented as C, but it is the affected reference sequence that changes.

ADD REPLY
0
Entering edit mode

In principle there it should be only one reference allele. What is the sequence of the reference genome at NCBI, for that position?

ADD REPLY
0
Entering edit mode

The problem is how to make it "proper" inside the VCF. The first base in the reference is C. Then we have a stretch of As. So (see my comment below) in fact it is the REF bit that should be writen in a different way.

ADD REPLY
3
Entering edit mode
10.5 years ago
Erik Garrison ★ 2.4k

You can combine these variants using the vcfmulti tool in vcflib:

<broken.vcf vcfcreatemulti >ok.vcf

However, this won't really handle the sample genotypes. These need to be recreated relative to each other. Ideally, this reconstruction should respect the underlying sequence reads.

Another approach would be to use a variant detection method that calls the samples and overlapping alleles jointly. I don't know the details of your pipeline, so perhaps this is inapplicable.

ADD COMMENT
0
Entering edit mode

I'll look into it, thanks. The issue here is a bug in VarScan, which was reported. However I'll see whether vcfmulti can work as stopgap solution.

ADD REPLY

Login before adding your answer.

Traffic: 2041 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6