combine/merge closet variants into one from mutect2
Entering edit mode
4 months ago
J.F.Jiang ▴ 910


Using GATK Mutect2, we collected several variants in EGFR region.

However, one variants NM_005228.5(EGFR):c.2239_2248delinsC (p.Leu747_Ala750delinsPro) were called as two separated ones. The two variants can not combined even we set the -max-mnp-distance to 20.

7 55242465 . GGAATTAAGA G . .
AS_SB_TABLE=5567,5732|44,36;DP=11794;ECNT=2;MBQ=20,20;MFRL=188,168;MMQ=60,60;MPOS=39;NALOD=3.85;NLOD=1755.82;POPAF=4.52;TLOD=256.20 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB
0|1:2957,80:0.027:3037:1304,28:1306,51:0|1:55242465_GGAATTAAGA_G:55242465:1402,1555,44,36 0|0:8342,0:1.710e-04:8342:3598,0:4195,0:0|1:55242465_GGAATTAAGA_G:55242465:4165,4177,0,0

7 55242478 . G C . .
AS_SB_TABLE=5360,5529|44,36;DP=11030;ECNT=2;MBQ=20,20;MFRL=189,168;MMQ=60,60;MPOS=43;NALOD=3.93;NLOD=1686.18;POPAF=4.52;TLOD=256.12 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB
0|1:2960,80:0.027:3040:1450,28:1427,50:0|1:55242465_GGAATTAAGA_G:55242465:1403,1557,44,36 0|0:7929,0:1.846e-04:7929:3522,0:4086,0:0|1:55242465_GGAATTAAGA_G:55242465:3957,3972,0,0

It will be glad if anyone could help.

Best, Junfeng

variants mutect2 merge gatk • 594 views
Entering edit mode

See if you can normalize the variants when passing it through

bcftools norm

but note that the variant is quite complex (10 deletions followed by a single insertion) and it could have different valid representations

I'd be curious to know whether bcftools norm would be able to resolve the issue

Entering edit mode

Hi Istvan,

Thanks for help.

bcftools norm did not work. Any other suggestions?

Entering edit mode

this is a tricky task because it basically asks one to merge neighboring variants that are technically correct even when listed separately. The challenge is not just merging the variants but updating all the other fields.

A quick search on "how to join neighboring variants in a VCF file" gave me this code:

I've asked the ChatGPT, as well it gave me the following:

Merging neighboring mutations in a VCF (Variant Call Format) file can be challenging because this format is designed to present each variant (mutation) individually. However, there are some ways to handle this task using Bioinformatics tools and scripting languages such as Python or R.

Here's a simplified Python script using the pysam library which can be used to merge consecutive single nucleotide polymorphisms (SNPs) from a VCF file. This script simply merges SNPs that are directly next to each other, and it might not cover all possible scenarios or more complex mutations.

import pysam

def merge_consecutive_snps(vcf_file):
    vcf_in = pysam.VariantFile(vcf_file, 'r')
    vcf_out = pysam.VariantFile('-', mode='w', header=vcf_in.header)

    prev_rec = None
    for rec in vcf_in.fetch():
        if prev_rec and rec.pos == prev_rec.pos + len(prev_rec.ref):
            prev_rec.ref += rec.ref
            prev_rec.alts = tuple([alt + rec_alt for alt, rec_alt in zip(prev_rec.alts, rec.alts)])
            if prev_rec:
            prev_rec = rec

    if prev_rec:



This script reads a VCF file and writes the merged variants to stdout. You should redirect the output to a file when running it. Note that you might need to install the pysam package first. You can do this with pip install pysam.

Please note that merging mutations is a complex topic and can have various interpretations depending on the biological or clinical context. This script only provides a very simplistic approach to the problem. If you have complex cases, such as multiple alternative alleles, insertions, deletions, or complex rearrangements, you might want to use more advanced tools designed for this purpose, like bcftools.

Moreover, it's essential to note that merging variants arbitrarily can lead to misinformation or loss of important details, especially if you are dealing with multiallelic sites or complex events. It is always a good idea to discuss with a team or supervisor what kind of merging is biologically appropriate for your specific case.

Entering edit mode

Hi Istavan,

code from gpt is not correct, since it did not consider distance between the adjacent SNVs, and whether those sequences were matched with reference or not.

Another online code neglect INDELs based on some reason.

Anyway, it inspired a bit to rewrite the code that can support INDELS.

Best, Junfeng

Entering edit mode

yes,this is a tricky subject, writing a generic tool for this purpose might be a publishable effort


Login before adding your answer.

Traffic: 1865 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6