Question: structural variants in gnomad and the VCF spec. Why is tabix/bcftools failing ?
gravatar for Pierre Lindenbaum
20 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

I downloaded the SV from gnomad:

wget -O gnomad_v2_sv.sites.vcf.gz ""

wget -O gnomad_v2_sv.sites.vcf.gz.tbi ""

First, an observation. For BND, the END value is used in the gnomad browser as the 'END' position of the second junction:

e.g: ; ;

( I submitted an issue )

Here is my problem: there is this variant:

$ bcftools view gnomad_v2_sv.sites.vcf.gz  | grep gnomAD_v2_CTX_12_13 -m1

1) The Broad uses SVTYPE=CTX : isn't it against the VCF spec ?

Value should be one of DEL, INS, DUP, INV, CNV, BND.

2) The Broad uses INFO/END as the second locus of the translocation. here chr12->60718971 chr13/END=57020218 isn't it against the VCF spec ?

3) And eventually, why can't I find this variant with bcftools (1.9-94-g9589876) or tabix ??

$ bcftools view gnomad_v2_sv.sites.vcf.gz "12:60718970-60718972" | grep gnomAD_v2_CTX_12_13
$ tabix gnomad_v2_sv.sites.vcf.gz "12:60718970-60718972" | grep gnomAD_v2_CTX_12_13
ADD COMMENTlink modified 20 months ago by John Marshall2.2k • written 20 months ago by Pierre Lindenbaum134k

Isn't SVTYPE=CTX a hold-over from TIGRA-SV? Maybe it's due to support in GRanges readVCF for non-compliant VCF SV types? Not that this would help with BCFTools but it might be an explanation as to how these SV annotations made it into Gnomad.

StructuralVariantAnnotation support structural variants reported in the following VCF notations: Non-symbolic allele Symbolic allele with SVTYPE of DEL, INS, and DUP. Breakpoint notation SVTYPE=BND Single breakend notation In addition to parsing spec-compliant VCFs, additional logic has been added to enable parsing of non-compliant variants for the following callers: Pindel (SVTYPE=RPL) manta (INv3, INV5 fields) Delly (SVTYPE=TRA, CHR2, CT fields) TIGRA (SVTYPE=CTX)

ADD REPLYlink modified 20 months ago • written 20 months ago by Garan670
gravatar for John Marshall
20 months ago by
John Marshall2.2k
Glasgow, Scotland
John Marshall2.2k wrote:

1) That says should, not must. To my reading, that means you can invent your own keywords as well, though it behooves you to describe them well. The text in the VCF 4.3 spec (which, once again, I recommend as being clearer and more detailed than the earlier documents) has

The reserved values must be used for the types listed below:

  • DEL: Deletion relative to the reference
  • INS: Insertion of novel sequence relative to the reference
  • DUP: Region of elevated copy number relative to the reference
  • INV: Inversion of reference sequence
  • CNV: Copy number variable region (may be both deletion and duplication)
  • BND: Breakend

which to my mind still means you are free to invent your own keywords for other things. Apparently they would rather use their own CTX (Reciprocal chromosomal translocation) than the specification's admittedly cumbersome BND.

2) “The Broad uses INFO/END as the second locus of the translocation”, while the VCF specification describes it as

End position (for use with symbolic alleles)

and further specifies it in §3 as

For precise variants, END is POS + length of REF allele − 1, and the for imprecise variants the corresponding best estimate.

GnomAD's use of END with a different meaning is in direct contravention of the VCF specification.

3) Indexing VCF uses the CHROM and POS as the start of the interval in which the record lies, and uses a combination of REF length, ALT length, and INFO/END to determine the end of the interval. So gnomAD's abuse of END containing unrelated values leads to a broken index.

If you decompress gnomad_v2_sv.sites.vcf.gz, use sed or your text editor to change all the END= to a different identifier, and then bgzip and bcftools index it again, you will find that the index works as expected and this 12:60718970-60718972 query finds just that one variant record.

This indexing problem is a pretty good demonstration of why gnomAD should not be inventing its own meaning for INFO/END or other VCF-spec-prescribed tags!

ADD COMMENTlink modified 20 months ago • written 20 months ago by John Marshall2.2k

To be sure, the spec doesn't give you any hint that this is important and used by the tools to construct indexes — sadly this is par for the course for this specification…

ADD REPLYlink written 20 months ago by John Marshall2.2k

This spec infelicity is now

ADD REPLYlink written 20 months ago by John Marshall2.2k

many thanks for this quick response John, I just opened an issue on github :

ADD REPLYlink written 20 months ago by Pierre Lindenbaum134k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1359 users visited in the last hour