Locating Multiple Alternate Alleles in gatk.vcf file
1
0
Entering edit mode
3.6 years ago
oars ▴ 180

I have a vcf file with the traditional header format...

#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878

When reading the output, is there a simple way to scan the data (visually) and see what sites have multiple alternate alleles? I'm also seeking a simple linux tool to call the multiple alternate alleles from the vcf file. I found this one as a possibility ( how to remove multiallelic from VCF )...

awk '/#/{print;next}{if($5 !~ /,/ && length($5)==1 && length($4)==1){print}}' file.vcf

But cannot figure out the syntax error?

awk: cmd. line:1: /#/{print;next}{if($5 !~ /,/ && length($5)==1 && length($4)==1){print}}SRR1611183.gatk.vcf
awk: cmd. line:1:                                                                                  ^ syntax error
GATK VCF • 1.6k views
ADD COMMENT
0
Entering edit mode

The link is dead. Can you repost it? I'd like to know how to do this using regular bash commands.

ADD REPLY
0
Entering edit mode

I have fixed it. Can you try again, James?

ADD REPLY
2
Entering edit mode
3.6 years ago

To just output multi-allelic sites, use:

bcftools view --min-alleles 3 --max-alleles 8 MyVariants.vcf

The max alleles can be anything but, for multi-allelic sites, min alleles has to be at least 3.

---------------------

If you then want to split multi-allelic calls into separate records, use:

bcftools norm -m-any MyVariants.vcf

Kevin

ADD COMMENT
0
Entering edit mode

Thanks Kevin! This worked. Now I'm trying to understand where in the dataset I can visually see where the multi-allelic sites are located?

Edit: I think I might understand this now - looking at column 5 (ALT), those with alleles separated with a comma represent multiple alternate allele sites?

ADD REPLY
1
Entering edit mode

Hello friend. Yes, I was not sure what you meant by 'visually'.

I've just pulled this example from a 'random' file on my own computer:

#CHROM  POS     ID          REF ALT
1       1581713 rs76922129  A   C,G

Indeed, column #5 is ALT and is where you'll see multi-alleles being reported. In this case, the total number of alleles at this position is 3, including the ref (A, C, G).

With the bcftools norm command, this will become:

#CHROM  POS     ID          REF ALT
1       1581713 rs76922129  A   C
1       1581713 rs76922129  A   G

--------------------------

Obviously, in a single-sample VCF for germline DNA, one must question the validity of a multi-allelic call. These are very common, however, in multi-sample VCFs and also in cancer samples, where multiple tumour clones may have been sequenced together.

ADD REPLY
1
Entering edit mode

Many thanks Kevin - this is exactly the explanation I was seeking!

ADD REPLY

Login before adding your answer.

Traffic: 1818 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6