Format VCF for dbSNP upload
Entering edit mode
8.3 years ago
willgilks ▴ 360

Prior to submission to NCBI dbSNP a vcf generated by e.g HaplotypeCaller requires several modifications:

  1. Addition of in-house identifiers. --> done
  2. Exclude if alternate allele is "*" i.e. they are in a deletion. --> probably use SelectVariants or FilterVariants.
  3. Exclude if ref or alt allele is greater than 50bp --> SelectVariants or FilterVariants --maxIndelSize 50
  4. Exclude if ref and alt alleles do not have a common leading base. --> Not sure ... removing larger indels won't exclude all of these.
  5. Add VRT (variant type) to Info field --> e.g VRT=1 (for an SNV), VRT=2 for an indel etc.Use GATK+SNPeff

Could anyone provide any good tips on excluding and annotating variants appropriately for NCBI ?

I'm looking into all of this today, and will post if I get any solutions.

dbSNP submission format

formatting vcf gatk dbSNP SNPeff • 2.2k views
Entering edit mode
7.1 years ago
willgilks ▴ 360

Answering my own question. Core of the solution is roughly:

## Remove variants with a null alternate allele.
sed '/\,\*/d' basic.f1.${vcf} > naa.basic.f1.${vcf}

## In header lines, add more info to fileformat. Add my laboratory name and ref assembly.
## replace Broad-GATK format variant type info with NCBI-dbSNP format.
## change variant type format from Broad-GATK to NCBI dbSNP.

sed -e 's|##fileformat=VCFv4.1|##fileformat=VCFv4.1\n##fileDate=20160423\n##handle=MORROW_EBE_SUSSEX\n##batch=GILKS_LHM_RG\n##reference=GCA_000001215.4\n##population_id=LHM_RG_hemiclones|g' \
    -e 's|;VariantType=SNP;set=variant|;VRT=1|g' \
    -e 's|;VariantType=MULTIALLELIC_SNP;set=variant|;VRT=1|g' \
    -e 's|;VariantType=INSERTION.*;set=variant|;VRT=2|g' \
    -e 's|;VariantType=DELETION.*;set=variant|;VRT=2|g' \
    -e 's|;VariantType=MULTIALLELIC_COMPLEX.Other;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_COMPLEX;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_MIXED.Other;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_MIXED;set=variant|;VRT=8|g' \
    -e 's|INFO=<ID=VariantType,Number=1,Type=String,Description="Variant type description">|INFO=<ID=VRT,Number=1,Type=Integer,Description="Variation type,1 - SNV: single nucleotide variation,2 - DIV: deletion/insertion variation,3 - HETEROZYGOUS: variable, but undefined at nucleotide level,4 - STR: short tandem repeat (microsatellite) variation, 5 - NAMED: insertion/deletion variation of named repetitive element,6 - NO VARIATON: sequence scanned for variation, but none observed,7 - MIXED: cluster contains submissions from 2 or more allelic classes (not used),8 - MNV: multiple nucleotide variation with alleles of common length greater than 1,9 - Exception">|g' naa.basic.f1.${vcf} > format.naa.basic.f1.${vcf}

## Re-add GATK variant type for completness and vcf indexing.

GenomeAnalysisTK -R ${refseq} \
    -T VariantAnnotator \
    -V format.naa.basic.f1.${vcf} \
    -A VariantType \
        -o dbSNP.${vcf}

Login before adding your answer.

Traffic: 2311 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6