What is the difference between norm --multiallelics -any versus --atomize?
10 months ago
Hello, forgive my ignorance-

Suppose input.vcf contains a complex multiallelic site.

What is the difference between

bcftools norm --multiallelics -any -f hg38.fa input.vcf


bcftools norm --atomize -f hg38.fa input.vcf

I understand what --multiallelics -any does but not sure what is going on with --atomize. In the documentation it says "Decompose complex variants, e.g. split MNVs into consecutive SNVs.". I do not understand what this means for a multiallelic site.

If someone has a good example that would help clarify, that would be great.

Thanks in advance.

bcftools
9 months ago
I don't think atomization compares to norm with respect to multiallelic sites. You can see an example of atomatization on a multi-allelic site in the example under --atom-overlaps option documentation:

# Before atomization:
    100  CC  C,GG   1/2

    # After:
    #   bcftools norm -a .
    100  C   G      ./1
    100  CC  C      1/.
    101  C   G      ./1

Normalization would just give you 2 records (I can't tell offhand what the GT field would be):

100 CC GG
100 CC C

Only the ALT field is split and the REF/POS are altered only in certain cases. MNVs are not split into SNVs - CC>GG remains CC>GG. I think when atomize is used MNVs will be split, so you get 2 C>G entries instead of one CC>GG entry. Note that this split would happen even if that record were not multiallelic.

Side note: I wonder if they meant bcftools norm -a --atom-overlaps . and not bcftools norm -a ., but that's not today's problem.


