High confident germline SNP and INDELS for VQSR for reference genome GRCh37
Entering edit mode
15 months ago
nhaus ▴ 210


I am currently trying to build a germline variant calling pipeline using GATK. One step is the Variant quality score recalibration. For this I need high confidence SNP and INDELS so I can train the model.

GATK offers these SNP and INDELS for the latest reference genome, but not for the one that I am using (GRCh37). I read that the vcf files from the 1000genome project contains only high confident germline mutation calls and think that it might be suitable for my purpose.

So my question is, if any of you know where I could download a VCF file which contains all of the SNPs and INDELs of the phase 3 1000 genomes project.

Is this possible to download the individual chromosomes from here and then combine them? I am afraid that this resource does not only contain "high confidence" variants. I think this might be the case, because combining these vcf files would result in a gigantic vcf file. However, the GRCh38 "gold standard high confidence snp" vcf from GATK is only 7 GB big when uncompressed.

I would be very grateful for any suggestions or links where I can download the data that I am looking for.


INDEL SNP 1000g GRCh37 • 649 views
Entering edit mode

This is great, thank you very much.

Will it be a problem that the GATK used the b37 reference genome and i am using the hs37d5.fa (GRCh37)? The included contigs are quite different.

Do you also by chance know where I can get something similar for the indels?


Login before adding your answer.

Traffic: 1848 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6