What does <*> mean in a vcf file?
2
4
Entering edit mode
4.1 years ago
Ketil 4.1k

Hi,

I'm running samtools (version 1.3.1, Ubuntu 17.04 default) to generate a VCF from a reference and some BAM files:

samtools mpileup --ff 0x800 -r my_contig -v -f my_genome.fa *.bam -o my.vcf

But in the VCF file, all lines have a format like:

my_contig    4       .       A       <*>     0       .       DP=1;I16=1,0,0,0,34,1156,0,0,0,0,0,0,0,0,0,0;QS=1,0;MQ0F=1      PL      0,0,0   0,3,4

In short, ALT (alternative allele) is set to be "<>". From the VCF specification, * indicates a deletion, while brackets indicate some sort of ID string. To me, none of these make much sense, and here the depth is 1 - how can there be any variants here? (For actual polymorphic sites, ALT is something like G,<>. As if that helps.)

I'm quite confused by this, and a subsequent 'bcftools' similarly fails:

Symbolic alleles other than <DEL> are currently not supported: <*> at my_contig:4

I can generate the consensus using a program I've written myself, but I would like to leverage whatever magic bcftools uses to QC polymorphisms, and also I think it is better to stick to more mainstream tools - providing they work, that is.

vcf bcftools samtools bcf • 8.0k views
ADD COMMENT
1
Entering edit mode

Was there ever a resolution to this problem? I'm also trying to get a consensus out of my vcf (same reason, I want the QC), and running into exactly the same error.

ADD REPLY
0
Entering edit mode

If you want a solution, i.e., to not have to deal with these ridiculous <*> calls, then piping bcftools mpileup into bcftools call should not result in these alleles being included in your final VCF. Check out my code here (Analysis Step 7): https://github.com/kevinblighe/ClinicalGradeDNAseq/blob/master/AnalysisMasterVersion1.sh

ADD REPLY
0
Entering edit mode

it's generated here: https://github.com/samtools/samtools/blob/master/bam2bcf.c#L741 I think there is no call/no ALT here. Did you bcftools call with '--variants-only' ?

ADD REPLY
0
Entering edit mode

I didn't use --variants-only (it's not an option to 'bcftools consensus', which I used). The output is from samtools and not bcftools, anyway. Thanks for the code pointer, but I can't really understand how this is supposed to work.

ADD REPLY
0
Entering edit mode

What's the VCF version? Check the first line of the VCF file to find its version

ADD REPLY
4
Entering edit mode
4.1 years ago
d-cameron ★ 2.4k

In short, ALT (alternative allele) is set to be "<>". From the VCF specification, * indicates a deletion, while brackets indicate some sort of ID string. To me, none of these make much sense, and here the depth is 1 - how can there be any variants here? (For actual polymorphic sites, ALT is something like G,<>. As if that helps.)

The VCF specifications includes multiple types of symbolic alleles, not all of which are listed in section 1.4.5 (this appears to be an oversight in the specifications document). The relevant section for your question is section 5.5:

5.5 Representing unspecified alleles and REF-only blocks (gVCF)

In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format† . The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele. Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise). A symbolic alternate allele <*> is used to represent this unspecified alternate allele

TDLR: <*> is used to indicate homozygous reference sites.

ADD COMMENT
0
Entering edit mode

Sorry but can you please explain what you mean by homozygous reference sites in this context?

ADD REPLY
0
Entering edit mode

It simply means wild type, I think.

ADD REPLY
1
Entering edit mode

Upon a second reading I think it may be a wildcard placeholder for any other alternative allele?

So if I have an output like this:

A (REF) G,<> (ALT) and the PL is: 0,57,255,78,255,255. So does this mean that the PLs correspond to: AA, AG, GG, A<>, G<>, <><*>?

ADD REPLY
0
Entering edit mode

Unfortunately, the specs do not explicitly state whether the symbolic allele includes explicitly listed alt alleles or not.

ADD REPLY
0
Entering edit mode

Illumina gVCF spec is equally vague, at least I guess it's a non specified alternative allele:

  • REF: Reference bases: A,C,G,T,N; there can be multiple bases. The value in the POS field refers to the position of the first base in the string. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT strings include the base before the event. This modification is reflected in the POS field. The exception is when the event occurs at position 1 on the contig, in which case they include the base after the event. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String "<id>"), the padding base is required. In that case, POS denotes the coordinate of the base preceding the polymorphism.

  • ALT: Comma-separated list of alternate non-reference alleles called on at least one of the samples. Options are:

    • Base strings made up of the bases A,C,G,T,N

    Angle-bracketed ID String (”<id>”)

    • Break-end replacement string as described in the section on break-ends.

ADD REPLY
0
Entering edit mode

the <*> is nowhere mentioned explicitly, but here's some more hints towards wildtype

GATK: What is a GVCF and how is it different from a 'regular' VCF?

SciLifeLab (AM Barrio): What is gVCF

ADD REPLY
3
Entering edit mode
4.1 years ago

Hi Ketil,

I encountered the same problem after I began using SAMtools to call variants again after years of having not used it. I was somewhat surprised to learn that they chose to (relatively recently) introduce the asterisk into the VCF format specification to represent a special kind of 'deletion' (in apostrophes). I'm not sure that it's the best choice as the asterisk symbol is used as a wild character in all of the programming languages of which I can think. I would have thought that an underscore, lower case 'd', or even hyphen would have been more suitable choices. It just means that one must take extra care if applying custom filtering methods on VCFs.

There is further information on the GATK website. Also, as anticipated, it has already caused problems. Finally, take a look at the official format specification release, dated May 2017 (see bottom of page 4).

ADD COMMENT

Login before adding your answer.

Traffic: 2722 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6