Greetings. The story is that I want to use ContEst from the broad institute as one of my quality control tools for NGS data. However, for the ContEst, it is required a population frequency VCF file as an input file, which should contain the information in the following format, CEU={A*=0.13030, G=0.86970}
. They provide hg18, and hg19 "right format" VCF files, but I need GRCh38.
###hg19.vcf for ContEst
#CHROM POS ID REF ALT QUAL FILTER INFO
1 566875 rs2185539 C T . PASS AC=66;AF=0.02369;ALL={C*=0.97629, T=0.02371};AN=2786;ASW={C*=1.00000, T=0.00000};CEU={C*=1.00000, T=0.00000};CHB={C*=1.00000, T=0.00000};CHD={C*=1.00000, T=0.00000};CHS={C*=0.00000, T=0.00000};CLM={C*=0.00000, T=0.00000};FIN={C*=0.00000, T=0.00000};GBR={C*=0.00000, T=0.00000};GIH={C*=1.00000, T=0.00000};IBS={C*=0.00000, T=0.00000};JPT={C*=1.00000, T=0.00000};LWK={C*=1.00000, T=0.00000};MKK={C*=0.82044, T=0.17956};MXL={C*=1.00000, T=0.00000};PUR={C*=0.00000, T=0.00000};TSI={C*=1.00000, T=0.00000};YRI={C*=0.99752, T=0.00248};set=MKK-YRI GT
First I tried to liftover to GRCh38 with picard liftoverVCF, and because the info column is not the default format, I failed.
Then, I tried to liftover the hapmap3.3 b37, which I got from the gatk dataset, and I believed the broad liftover'd from b36 to b37 for hg19, to GRCh38 also with picard liftoverVCF, and for the same reason, I failed again.
###hapmap3.3 b37
#CHROM POS ID REF ALT QUAL FILTER INFO
1 55299 rs10399749 C . . PASS AN=510
1 55394 rs2949420 T . . PASS AN=178
1 55550 rs2949421 A T . PASS AC=173;AF=0.972;AN=178
And then, I tried to use the latest VCF released from NCBI, which is common_all_20150416.vcf
, to build the population frequency file, but which comes with the question about how I can get population information from CAF.
###common_all_20150416.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10177 rs367896724 A AC . . RS=367896724;RSPOS=10177;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020005140026000200;WGT=1;VC=DIV;R5;ASP;VLD;KGPhase3;CAF=0.5747,0.4253;COMMON=1
I am new to this area. Any piece of advice would be helpful. The question is how or where I can translate the information from "CAF=0.5747,0.4253" to specific population frequency.
The other choice for me is to liftover the original hapmap data mapped to hg38, which comes another following question:
There are only 11 allele frequency/genotype frequency in hapmap 2010 phase 2+3. However, in the ContEst input vcf file, there are 17 groups instead. Where can I find those missing 6 group information? and what kind of tool should I use for liftover hapmap data?
Regards
Edited:
Actually, picard liftover function is broken in all version. Here is the same question I asked on GATK forum: http://gatkforums.broadinstitute.org/discussion/5625/how-to-build-population-frequency-file-based-on-grch38-for-contest#latest