Question

Format VCF for dbSNP upload

1

Entering edit mode

9.7 years ago

willgilks ▴ 360

Prior to submission to NCBI dbSNP a vcf generated by e.g HaplotypeCaller requires several modifications:

Addition of in-house identifiers. --> done
Exclude if alternate allele is "*" i.e. they are in a deletion. --> probably use SelectVariants or FilterVariants.
Exclude if ref or alt allele is greater than 50bp --> SelectVariants or FilterVariants --maxIndelSize 50
Exclude if ref and alt alleles do not have a common leading base. --> Not sure ... removing larger indels won't exclude all of these.
Add VRT (variant type) to Info field --> e.g VRT=1 (for an SNV), VRT=2 for an indel etc.Use GATK+SNPeff

Could anyone provide any good tips on excluding and annotating variants appropriately for NCBI ?

I'm looking into all of this today, and will post if I get any solutions.

dbSNP submission format http://www.ncbi.nlm.nih.gov/SNP/docs/dbSNP_VCF_Submission.pdf

formatting vcf gatk dbSNP SNPeff • 2.5k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.7 years ago by willgilks ▴ 360

score 1 · Answer 1 · 2017-01-30

Answering my own question. Core of the solution is roughly:

## Remove variants with a null alternate allele.
sed '/\,\*/d' basic.f1.${vcf} > naa.basic.f1.${vcf}

## In header lines, add more info to fileformat. Add my laboratory name and ref assembly.
## replace Broad-GATK format variant type info with NCBI-dbSNP format.
## change variant type format from Broad-GATK to NCBI dbSNP.

sed -e 's|##fileformat=VCFv4.1|##fileformat=VCFv4.1\n##fileDate=20160423\n##handle=MORROW_EBE_SUSSEX\n##batch=GILKS_LHM_RG\n##reference=GCA_000001215.4\n##population_id=LHM_RG_hemiclones|g' \
    -e 's|;VariantType=SNP;set=variant|;VRT=1|g' \
    -e 's|;VariantType=MULTIALLELIC_SNP;set=variant|;VRT=1|g' \
    -e 's|;VariantType=INSERTION.*;set=variant|;VRT=2|g' \
    -e 's|;VariantType=DELETION.*;set=variant|;VRT=2|g' \
    -e 's|;VariantType=MULTIALLELIC_COMPLEX.Other;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_COMPLEX;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_MIXED.Other;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_MIXED;set=variant|;VRT=8|g' \
    -e 's|INFO=<ID=VariantType,Number=1,Type=String,Description="Variant type description">|INFO=<ID=VRT,Number=1,Type=Integer,Description="Variation type,1 - SNV: single nucleotide variation,2 - DIV: deletion/insertion variation,3 - HETEROZYGOUS: variable, but undefined at nucleotide level,4 - STR: short tandem repeat (microsatellite) variation, 5 - NAMED: insertion/deletion variation of named repetitive element,6 - NO VARIATON: sequence scanned for variation, but none observed,7 - MIXED: cluster contains submissions from 2 or more allelic classes (not used),8 - MNV: multiple nucleotide variation with alleles of common length greater than 1,9 - Exception">|g' naa.basic.f1.${vcf} > format.naa.basic.f1.${vcf}

## Re-add GATK variant type for completness and vcf indexing.

GenomeAnalysisTK -R ${refseq} \
    -T VariantAnnotator \
    -V format.naa.basic.f1.${vcf} \
    -A VariantType \
        -o dbSNP.${vcf}