Question: Format VCF for dbSNP upload
1
gravatar for willgilks
4.7 years ago by
willgilks300
United Kingdom
willgilks300 wrote:

Prior to submission to NCBI dbSNP a vcf generated by e.g HaplotypeCaller requires several modifications:

  1. Addition of in-house identifiers.
    .................................................... done
  2. Exclude if alternate allele is "*" i.e. they are in a deletion.
    .................................................... probably use SelectVariants or FilterVariants.
  3. Exclude if ref or alt allele is greater than 50bp
    .................................................... SelectVariants or FilterVariants --maxIndelSize 50
  4. Exclude if ref and alt alleles do not have a common leading base.
    .................................................... Not sure ... removing larger indels won't exclude all of these.
  5. Add VRT (variant type) to Info field
    .....................................................e.g VRT=1 (for an SNV), VRT=2 for an indel etc.Use GATK+SNPeff

Could anyone provide any good tips on excluding and annotating variants appropriately for NCBI ?

I'm looking into all of this today, and will post if I get any solutions.

dbSNP submission format http://www.ncbi.nlm.nih.gov/SNP/docs/dbSNP_VCF_Submission.pdf

 

formatting dbsnp snpeff gatk vcf • 1.4k views
ADD COMMENTlink modified 3.5 years ago • written 4.7 years ago by willgilks300
1
gravatar for willgilks
3.5 years ago by
willgilks300
United Kingdom
willgilks300 wrote:

Answering my own question. Core of the solution is roughly:

## Remove variants with a null alternate allele.
sed '/\,\*/d' basic.f1.${vcf} > naa.basic.f1.${vcf}

## In header lines, add more info to fileformat. Add my laboratory name and ref assembly.
## replace Broad-GATK format variant type info with NCBI-dbSNP format.
## change variant type format from Broad-GATK to NCBI dbSNP.

sed -e 's|##fileformat=VCFv4.1|##fileformat=VCFv4.1\n##fileDate=20160423\n##handle=MORROW_EBE_SUSSEX\n##batch=GILKS_LHM_RG\n##reference=GCA_000001215.4\n##population_id=LHM_RG_hemiclones|g' \
    -e 's|;VariantType=SNP;set=variant|;VRT=1|g' \
    -e 's|;VariantType=MULTIALLELIC_SNP;set=variant|;VRT=1|g' \
    -e 's|;VariantType=INSERTION.*;set=variant|;VRT=2|g' \
    -e 's|;VariantType=DELETION.*;set=variant|;VRT=2|g' \
    -e 's|;VariantType=MULTIALLELIC_COMPLEX.Other;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_COMPLEX;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_MIXED.Other;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_MIXED;set=variant|;VRT=8|g' \
    -e 's|INFO=<ID=VariantType,Number=1,Type=String,Description="Variant type description">|INFO=<ID=VRT,Number=1,Type=Integer,Description="Variation type,1 - SNV: single nucleotide variation,2 - DIV: deletion/insertion variation,3 - HETEROZYGOUS: variable, but undefined at nucleotide level,4 - STR: short tandem repeat (microsatellite) variation, 5 - NAMED: insertion/deletion variation of named repetitive element,6 - NO VARIATON: sequence scanned for variation, but none observed,7 - MIXED: cluster contains submissions from 2 or more allelic classes (not used),8 - MNV: multiple nucleotide variation with alleles of common length greater than 1,9 - Exception">|g' naa.basic.f1.${vcf} > format.naa.basic.f1.${vcf}

## Re-add GATK variant type for completness and vcf indexing.

GenomeAnalysisTK -R ${refseq} \
    -T VariantAnnotator \
    -V format.naa.basic.f1.${vcf} \
    -A VariantType \
        -o dbSNP.${vcf}
ADD COMMENTlink written 3.5 years ago by willgilks300
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1583 users visited in the last hour