Locating Multiple Alternate Alleles in gatk.vcf file
3.6 years ago
oars

#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878


When reading the output, is there a simple way to scan the data (visually) and see what sites have multiple alternate alleles? I'm also seeking a simple linux tool to call the multiple alternate alleles from the vcf file. I found this one as a possibility ( how to remove multiallelic from VCF )...

awk '/#/{print;next}{if($5 !~ /,/ && length($5)==1 && length($4)==1){print}}' file.vcf  But cannot figure out the syntax error? awk: cmd. line:1: /#/{print;next}{if($5 !~ /,/ && length($5)==1 && length($4)==1){print}}SRR1611183.gatk.vcf
awk: cmd. line:1:                                                                                  ^ syntax error

The link is dead. Can you repost it? I'd like to know how to do this using regular bash commands.

I have fixed it. Can you try again, James?

3.6 years ago

To just output multi-allelic sites, use:

bcftools view --min-alleles 3 --max-alleles 8 MyVariants.vcf


The max alleles can be anything but, for multi-allelic sites, min alleles has to be at least 3.

---------------------

If you then want to split multi-allelic calls into separate records, use:

bcftools norm -m-any MyVariants.vcf


Kevin

Thanks Kevin! This worked. Now I'm trying to understand where in the dataset I can visually see where the multi-allelic sites are located?

Edit: I think I might understand this now - looking at column 5 (ALT), those with alleles separated with a comma represent multiple alternate allele sites?

Hello friend. Yes, I was not sure what you meant by 'visually'.

I've just pulled this example from a 'random' file on my own computer:

#CHROM  POS     ID          REF ALT
1       1581713 rs76922129  A   C,G


Indeed, column #5 is ALT and is where you'll see multi-alleles being reported. In this case, the total number of alleles at this position is 3, including the ref (A, C, G).

With the bcftools norm command, this will become:

#CHROM  POS     ID          REF ALT
1       1581713 rs76922129  A   C
1       1581713 rs76922129  A   G


--------------------------

Obviously, in a single-sample VCF for germline DNA, one must question the validity of a multi-allelic call. These are very common, however, in multi-sample VCFs and also in cancer samples, where multiple tumour clones may have been sequenced together.

Many thanks Kevin - this is exactly the explanation I was seeking!