Question: What does <*> mean in a vcf file?
1
gravatar for Ketil
13 months ago by
Ketil3.9k
Germany
Ketil3.9k wrote:

Hi,

I'm running samtools (version 1.3.1, Ubuntu 17.04 default) to generate a VCF from a reference and some BAM files:

samtools mpileup --ff 0x800 -r my_contig -v -f my_genome.fa *.bam -o my.vcf

But in the VCF file, all lines have a format like:

my_contig    4       .       A       <*>     0       .       DP=1;I16=1,0,0,0,34,1156,0,0,0,0,0,0,0,0,0,0;QS=1,0;MQ0F=1      PL      0,0,0   0,3,4

In short, ALT (alternative allele) is set to be "<>". From the VCF specification, * indicates a deletion, while brackets indicate some sort of ID string. To me, none of these make much sense, and here the depth is 1 - how can there be any variants here? (For actual polymorphic sites, ALT is something like G,<>. As if that helps.)

I'm quite confused by this, and a subsequent 'bcftools' similarly fails:

Symbolic alleles other than <DEL> are currently not supported: <*> at my_contig:4

I can generate the consensus using a program I've written myself, but I would like to leverage whatever magic bcftools uses to QC polymorphisms, and also I think it is better to stick to more mainstream tools - providing they work, that is.

vcf bcf samtools bcftools • 1.7k views
ADD COMMENTlink modified 13 months ago by d-cameron1.9k • written 13 months ago by Ketil3.9k

it's generated here: https://github.com/samtools/samtools/blob/master/bam2bcf.c#L741 I think there is no call/no ALT here. Did you bcftools call with '--variants-only' ?

ADD REPLYlink written 13 months ago by Pierre Lindenbaum115k

I didn't use --variants-only (it's not an option to 'bcftools consensus', which I used). The output is from samtools and not bcftools, anyway. Thanks for the code pointer, but I can't really understand how this is supposed to work.

ADD REPLYlink written 13 months ago by Ketil3.9k

What's the VCF version? Check the first line of the VCF file to find its version

ADD REPLYlink written 13 months ago by RamRS19k
2
gravatar for d-cameron
13 months ago by
d-cameron1.9k
Australia
d-cameron1.9k wrote:

In short, ALT (alternative allele) is set to be "<>". From the VCF specification, * indicates a deletion, while brackets indicate some sort of ID string. To me, none of these make much sense, and here the depth is 1 - how can there be any variants here? (For actual polymorphic sites, ALT is something like G,<>. As if that helps.)

The VCF specifications includes multiple types of symbolic alleles, not all of which are listed in section 1.4.5 (this appears to be an oversight in the specifications document). The relevant section for your question is section 5.5:

5.5 Representing unspecified alleles and REF-only blocks (gVCF)

In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format† . The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele. Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise). A symbolic alternate allele <*> is used to represent this unspecified alternate allele

TDLR: <*> is used to indicate homozygous reference sites.

ADD COMMENTlink modified 13 months ago • written 13 months ago by d-cameron1.9k

Sorry but can you please explain what you mean by homozygous reference sites in this context?

ADD REPLYlink written 8 days ago by tbb210

It simply means wild type, I think.

ADD REPLYlink written 8 days ago by RamRS19k

Upon a second reading I think it may be a wildcard placeholder for any other alternative allele?

So if I have an output like this:

A (REF) G,<> (ALT) and the PL is: 0,57,255,78,255,255. So does this mean that the PLs correspond to: AA, AG, GG, A<>, G<>, <><*>?

ADD REPLYlink written 7 days ago by tbb210

Unfortunately, the specs do not explicitly state whether the symbolic allele includes explicitly listed alt alleles or not.

ADD REPLYlink written 4 days ago by d-cameron1.9k
1
gravatar for Kevin Blighe
13 months ago by
Kevin Blighe33k
Republic of Ireland
Kevin Blighe33k wrote:

Hi Ketil,

I encountered the same problem after I began using SAMtools to call variants again after years of having not used it. I was somewhat surprised to learn that they chose to (relatively recently) introduce the asterisk into the VCF format specification to represent a special kind of 'deletion' (in apostrophes). I'm not sure it's the best choice as the asterisk symbol is used as a wild character in all of the programming languages that I can think of. I would have thought that an underscore, lower case 'd', or even hyphen would have been more suitable choices. It just means that one must take extra care if applying custom filtering methods on VCFs.

There is further information on the GATK website. Also, as anticipated, it has already caused problems. Finally, take a look at the official format specification release, dated May 2017 (see bottom of page 4).

ADD COMMENTlink modified 8 days ago • written 13 months ago by Kevin Blighe33k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 677 users visited in the last hour