Validation of VCF file generated from GATK pipeline
1
0
Entering edit mode
2.1 years ago

Dear Biostarians,

Using GATK i successfully created a VCF file.Now i have to validate it.In GATK itself there is an command option to do it

  gatk ValidateVariants \
   -R ref.fasta \
   -V input.vcf \
   --dbsnp dbsnp.vcf

Here in " --dbsnp " which dbsp file i have to use ,I am confused regarding latest GCF_000001405.39.gz or All.vcf.gz.And also lot other human VCF file too there like archive folder and GATK folder too which further confuses me which to use here.The above both VCF file links.

Their links https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz or

https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-All.vcf.gz

My data is processed with GRCh38 reference genome.

Or anyother validations for VCF is there please let me knew it.

Thanks in advance

VCF dbsnp GATK • 1.3k views
ADD COMMENT
1
Entering edit mode
2.1 years ago
GenoMax 142k

From Readme file:

RefSNP VCF files for GRC (Genome Reference Consortium) human assembly 37 (GCF_000001405.25) and 38 (GCF_000001405.39). Files are compressed by bgzip and with the tabix index.

So use https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz

ADD COMMENT
0
Entering edit mode

Thanks Genomax for your reply,I will download the same as you recommended and getback to you soon :)

ADD REPLY
0
Entering edit mode

Hello genmax, I ran the validation but got errors regarding compatability of contigs.I am here attaching my command & terminal output

command i used: gatk ValidateVariants \ -R /media/lab/Lab/GRCh38genome/Homo_sapiens.GRCh38.dna.toplevelfiltered.fa\ -V /media/lab/Lab10TB/0VCF/02h1WGS/05dbsnp/rawsnpsbwa.vcf.gz \ --dbsnp GCF_000001405.39.gz

terminaloutput

ADD REPLY
1
Entering edit mode

Based on the name it appears that you are using toplevel data file from Ensembl. This file contains haplotypes etc and is generally not needed for normal data analysis. That must be one of the reasons for the error. Other is the chromosome designations in RefSeq SNP file may not match what you have. Likely if you are using toplevel file. See: Why is human genome FASTA file on GENCODE much smaller than that on ENSEMBL?

ADD REPLY
0
Entering edit mode

I extracted all 1-22,x,y and MT chromosomes alones from the toplevel reference genome.So no problem with haplotypes here.Only the chromosome names varies here like NC_000001.11 is named for chromosome 1 etc.How i can make this file compatible here or is there any other way to validate a vcf file with dbsnp data

ADD REPLY
1
Entering edit mode

Since changing chromosome names in either file is going to be a big task so perhaps you could use Ensembl provided VCF files. http://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/ . They may have matching chromosome names. Be sure to check the readme included in this directory to see if these files will work.

ADD REPLY

Login before adding your answer.

Traffic: 1586 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6