structural variants in gnomad and the VCF spec. Why is tabix/bcftools failing ?
1
2
Entering edit mode
2.1 years ago

wget -O gnomad_v2_sv.sites.vcf.gz "https://storage.googleapis.com/gnomad-public/papers/2019-sv/gnomad_v2_sv.sites.vcf.gz"



First, an observation. For BND, the END value is used in the gnomad browser as the 'END' position of the second junction:

( I submitted an issue https://github.com/macarthur-lab/gnomad_browser/issues/139 )

Here is my problem: there is this variant:

$bcftools view gnomad_v2_sv.sites.vcf.gz | grep gnomAD_v2_CTX_12_13 -m1 12 60718971 gnomAD_v2_CTX_12_13 N <CTX> 999 PASS END=57020218;SVTYPE=CTX;CHR2=13;SVLEN=-1;ALGORITHMS=manta;EVIDENCE=PE;CPX_TYPE=CTX_PP/QQ;PROTEIN_CODING__NEAREST_TSS=SLC16A7;PROTEIN_CODING__INTERGENIC;AN=21476;AC=1;AF=4.7e-05;N_BI_GENOS=10738;N_HOMREF=10737;N_HET=1;N_HOMALT=0;FREQ_HOMREF=0.999907;FREQ_HET=9.31272e-05;FREQ_HOMALT=0;AFR_AN=9480;AFR_AC=0;AFR_AF=0;AFR_N_BI_GENOS=4740;AFR_N_HOMREF=4740;AFR_N_HET=0;AFR_N_HOMALT=0;AFR_FREQ_HOMREF=1;AFR_FREQ_HET=0;AFR_FREQ_HOMALT=0;AMR_AN=1784;AMR_AC=1;AMR_AF=0.000561;AMR_N_BI_GENOS=892;AMR_N_HOMREF=891;AMR_N_HET=1;AMR_N_HOMALT=0;AMR_FREQ_HOMREF=0.998879;AMR_FREQ_HET=0.00112108;AMR_FREQ_HOMALT=0;EAS_AN=2226;EAS_AC=0;EAS_AF=0;EAS_N_BI_GENOS=1113;EAS_N_HOMREF=1113;EAS_N_HET=0;EAS_N_HOMALT=0;EAS_FREQ_HOMREF=1;EAS_FREQ_HET=0;EAS_FREQ_HOMALT=0;EUR_AN=7598;EUR_AC=0;EUR_AF=0;EUR_N_BI_GENOS=3799;EUR_N_HOMREF=3799;EUR_N_HET=0;EUR_N_HOMALT=0;EUR_FREQ_HOMREF=1;EUR_FREQ_HET=0;EUR_FREQ_HOMALT=0;OTH_AN=388;OTH_AC=0;OTH_AF=0;OTH_N_BI_GENOS=194;OTH_N_HOMREF=194;OTH_N_HET=0;OTH_N_HOMALT=0;OTH_FREQ_HOMREF=1;OTH_FREQ_HET=0;OTH_FREQ_HOMALT=0;POPMAX_AF=0.000561  1) The Broad uses SVTYPE=CTX : isn't it against the VCF spec ? Value should be one of DEL, INS, DUP, INV, CNV, BND. 2) The Broad uses INFO/END as the second locus of the translocation. here chr12->60718971 chr13/END=57020218 isn't it against the VCF spec ? 3) And eventually, why can't I find this variant with bcftools (1.9-94-g9589876) or tabix ?? $ bcftools view gnomad_v2_sv.sites.vcf.gz "12:60718970-60718972" | grep gnomAD_v2_CTX_12_13
 tabix gnomad_v2_sv.sites.vcf.gz "12:60718970-60718972" | grep gnomAD_v2_CTX_12_13
\$

vcf specification gnomad sv structural variant • 1.3k views
0
Entering edit mode

Isn't SVTYPE=CTX a hold-over from TIGRA-SV? Maybe it's due to support in GRanges readVCF for non-compliant VCF SV types? Not that this would help with BCFTools but it might be an explanation as to how these SV annotations made it into Gnomad.

https://bioconductor.org/packages/devel/bioc/vignettes/StructuralVariantAnnotation/inst/doc/vignettes.html

StructuralVariantAnnotation support structural variants reported in the following VCF notations: Non-symbolic allele Symbolic allele with SVTYPE of DEL, INS, and DUP. Breakpoint notation SVTYPE=BND Single breakend notation In addition to parsing spec-compliant VCFs, additional logic has been added to enable parsing of non-compliant variants for the following callers: Pindel (SVTYPE=RPL) manta (INv3, INV5 fields) Delly (SVTYPE=TRA, CHR2, CT fields) TIGRA (SVTYPE=CTX)

6
Entering edit mode
2.1 years ago

1) That says should, not must. To my reading, that means you can invent your own keywords as well, though it behooves you to describe them well. The text in the VCF 4.3 spec (which, once again, I recommend as being clearer and more detailed than the earlier documents) has

The reserved values must be used for the types listed below:

• DEL: Deletion relative to the reference
• INS: Insertion of novel sequence relative to the reference
• DUP: Region of elevated copy number relative to the reference
• INV: Inversion of reference sequence
• CNV: Copy number variable region (may be both deletion and duplication)
• BND: Breakend

which to my mind still means you are free to invent your own keywords for other things. Apparently they would rather use their own CTX (Reciprocal chromosomal translocation) than the specification's admittedly cumbersome BND.

2) “The Broad uses INFO/END as the second locus of the translocation”, while the VCF specification describes it as

End position (for use with symbolic alleles)

and further specifies it in §3 as

For precise variants, END is POS + length of REF allele − 1, and the for imprecise variants the corresponding best estimate.

GnomAD's use of END with a different meaning is in direct contravention of the VCF specification.

3) Indexing VCF uses the CHROM and POS as the start of the interval in which the record lies, and uses a combination of REF length, ALT length, and INFO/END to determine the end of the interval. So gnomAD's abuse of END containing unrelated values leads to a broken index.

If you decompress gnomad_v2_sv.sites.vcf.gz, use sed or your text editor to change all the END= to a different identifier, and then bgzip and bcftools index it again, you will find that the index works as expected and this 12:60718970-60718972 query finds just that one variant record.

This indexing problem is a pretty good demonstration of why gnomAD should not be inventing its own meaning for INFO/END or other VCF-spec-prescribed tags!

1
Entering edit mode

To be sure, the spec doesn't give you any hint that this is important and used by the tools to construct indexes — sadly this is par for the course for this specification…

1
Entering edit mode
0
Entering edit mode

many thanks for this quick response John, I just opened an issue on github : https://github.com/macarthur-lab/gnomad_browser/issues/140

Traffic: 1601 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.