Question: What does <*> mean in a vcf file?
2
gravatar for Ketil
16 months ago by
Ketil3.9k
Germany
Ketil3.9k wrote:

Hi,

I'm running samtools (version 1.3.1, Ubuntu 17.04 default) to generate a VCF from a reference and some BAM files:

samtools mpileup --ff 0x800 -r my_contig -v -f my_genome.fa *.bam -o my.vcf

But in the VCF file, all lines have a format like:

my_contig    4       .       A       <*>     0       .       DP=1;I16=1,0,0,0,34,1156,0,0,0,0,0,0,0,0,0,0;QS=1,0;MQ0F=1      PL      0,0,0   0,3,4

In short, ALT (alternative allele) is set to be "<>". From the VCF specification, * indicates a deletion, while brackets indicate some sort of ID string. To me, none of these make much sense, and here the depth is 1 - how can there be any variants here? (For actual polymorphic sites, ALT is something like G,<>. As if that helps.)

I'm quite confused by this, and a subsequent 'bcftools' similarly fails:

Symbolic alleles other than <DEL> are currently not supported: <*> at my_contig:4

I can generate the consensus using a program I've written myself, but I would like to leverage whatever magic bcftools uses to QC polymorphisms, and also I think it is better to stick to more mainstream tools - providing they work, that is.

vcf bcf samtools bcftools • 2.0k views
ADD COMMENTlink modified 15 months ago by d-cameron2.0k • written 16 months ago by Ketil3.9k

it's generated here: https://github.com/samtools/samtools/blob/master/bam2bcf.c#L741 I think there is no call/no ALT here. Did you bcftools call with '--variants-only' ?

ADD REPLYlink written 16 months ago by Pierre Lindenbaum117k

I didn't use --variants-only (it's not an option to 'bcftools consensus', which I used). The output is from samtools and not bcftools, anyway. Thanks for the code pointer, but I can't really understand how this is supposed to work.

ADD REPLYlink written 16 months ago by Ketil3.9k

What's the VCF version? Check the first line of the VCF file to find its version

ADD REPLYlink written 15 months ago by RamRS20k
3
gravatar for d-cameron
15 months ago by
d-cameron2.0k
Australia
d-cameron2.0k wrote:

In short, ALT (alternative allele) is set to be "<>". From the VCF specification, * indicates a deletion, while brackets indicate some sort of ID string. To me, none of these make much sense, and here the depth is 1 - how can there be any variants here? (For actual polymorphic sites, ALT is something like G,<>. As if that helps.)

The VCF specifications includes multiple types of symbolic alleles, not all of which are listed in section 1.4.5 (this appears to be an oversight in the specifications document). The relevant section for your question is section 5.5:

5.5 Representing unspecified alleles and REF-only blocks (gVCF)

In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format† . The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele. Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise). A symbolic alternate allele <*> is used to represent this unspecified alternate allele

TDLR: <*> is used to indicate homozygous reference sites.

ADD COMMENTlink modified 15 months ago • written 15 months ago by d-cameron2.0k

Sorry but can you please explain what you mean by homozygous reference sites in this context?

ADD REPLYlink written 11 weeks ago by tbb2110

It simply means wild type, I think.

ADD REPLYlink written 11 weeks ago by RamRS20k
1

Upon a second reading I think it may be a wildcard placeholder for any other alternative allele?

So if I have an output like this:

A (REF) G,<> (ALT) and the PL is: 0,57,255,78,255,255. So does this mean that the PLs correspond to: AA, AG, GG, A<>, G<>, <><*>?

ADD REPLYlink written 10 weeks ago by tbb2110

Unfortunately, the specs do not explicitly state whether the symbolic allele includes explicitly listed alt alleles or not.

ADD REPLYlink written 10 weeks ago by d-cameron2.0k

Illumina gVCF spec is equally vague, at least I guess it's a non specified alternative allele:

  • REF: Reference bases: A,C,G,T,N; there can be multiple bases. The value in the POS field refers to the position of the first base in the string. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT strings include the base before the event. This modification is reflected in the POS field. The exception is when the event occurs at position 1 on the contig, in which case they include the base after the event. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String "<id>"), the padding base is required. In that case, POS denotes the coordinate of the base preceding the polymorphism.

  • ALT: Comma-separated list of alternate non-reference alleles called on at least one of the samples. Options are:

    • Base strings made up of the bases A,C,G,T,N

    Angle-bracketed ID String (”<id>”)

    • Break-end replacement string as described in the section on break-ends.

ADD REPLYlink modified 10 days ago • written 10 days ago by Carambakaracho790

the <*> is nowhere mentioned explicitly, but here's some more hints towards wildtype

GATK: What is a GVCF and how is it different from a 'regular' VCF?

SciLifeLab (AM Barrio): What is gVCF

ADD REPLYlink modified 10 days ago by RamRS20k • written 10 days ago by Carambakaracho790
2
gravatar for Kevin Blighe
16 months ago by
Kevin Blighe37k
Republic of Ireland
Kevin Blighe37k wrote:

Hi Ketil,

I encountered the same problem after I began using SAMtools to call variants again after years of having not used it. I was somewhat surprised to learn that they chose to (relatively recently) introduce the asterisk into the VCF format specification to represent a special kind of 'deletion' (in apostrophes). I'm not sure it's the best choice as the asterisk symbol is used as a wild character in all of the programming languages that I can think of. I would have thought that an underscore, lower case 'd', or even hyphen would have been more suitable choices. It just means that one must take extra care if applying custom filtering methods on VCFs.

There is further information on the GATK website. Also, as anticipated, it has already caused problems. Finally, take a look at the official format specification release, dated May 2017 (see bottom of page 4).

ADD COMMENTlink modified 11 weeks ago • written 16 months ago by Kevin Blighe37k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1271 users visited in the last hour