Getting allele frequency and zygosity information from several vcf file
1
1
Entering edit mode
3.1 years ago
seta ★ 1.5k

Dear all,

I have multiple vcf file and want to extract allele frequency and the hetero- and homozygous call with their count. Could you please let me know if I should merge all vcf file to get one vcf file, and then extract the required information? If yes, I read CombineVariants from GATK do this job, but never use it, please kindly tell me is it OK for my task in your opinion?

For extracting the required information (allele frequency and the hetero- and homozygous call with their count), I found that convert2annovar.pl from annovar package help to get this information. However, when I tested on the example vcf file containing 3 for getting zygosity information by the below command:

perl convert2annovar.pl ex2.vcf --format vcf4 --allsample --withzyg --outfile file1.vcf

It gave me 3 output files corresponded to 3 samples with hetero and homo information, could you please kindly tell me how I can have all information in one file? Assuming multi-sample vcf file containing 100 samples, so 100 output files will be generated, how we could handle them?

Also, I tried vcftools to obtain the required information, but sounds that it give us just frequency, is any experience with vcftools for getting zygosity information? Any alternative tools and commands suggestion would be highly appreciated.

Thanks

vcf allele frequency zygocity • 1.9k views
1
Entering edit mode
3.1 years ago

This may help, tested on 1000 Genomes Phase 3 variants (total sample n=2504):

bcftools view /Kev/CollegeWork/ReferenceMaterial/1000Genomes/ImputedGenotypes/1000Genomes.Norm.bcf | \
awk -F"\t" 'BEGIN {
print "CHR\tPOS\tID\tREF\tALT\tAltHetCount\tAltHomCount\tRefHomCount"
}
!/^#/ {
print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t" gsub(/0\|1|1\|0|0\/1|1\/0/,"") "\t" gsub(/1\/1|1\|1/,"") "\t" gsub(/0\/0|0\|0/,"") }' | \ head -18 CHR POS ID REF ALT AltHetCount AltHomCount RefHomCount 1 10177 rs367896724 A AC 1490 320 694 1 10235 rs540431307 T TA 6 0 2498 1 10352 rs555500075 T TA 2025 83 396 1 10505 rs548419688 A T 1 0 2503 1 10506 rs568405545 C G 1 0 2503 1 10511 rs534229142 G A 1 0 2503 1 10539 rs537182016 C A 3 0 2501 1 10542 rs572818783 C T 1 0 2503 1 10579 rs538322974 C A 1 0 2503 1 10616 rs376342519 CCGCCGTTGCAAAGGCGCGCCG C 35 2469 0 1 10642 rs558604819 G A 21 0 2483 1 11008 rs575272151 C G 403 19 2082 1 11012 rs544419019 C G 403 19 2082 1 11063 rs561109771 T G 15 0 2489 1 13011 rs574746232 T G 3 0 2501 1 13110 rs540538026 G A 134 0 2370 1 13116 rs62635286 T G 414 36 2054 1 13118 rs200579949 A G 414 36 2054 It counts AltHet as either of: • 0/1 • 1/0 • 1|0 • 0|1 It counts AltHom as either of: • 1/1 • 1|1 It counts RefHom as either of: • 0/0 • 0|0 Other scripts: Kevin ADD COMMENT 0 Entering edit mode Hi Kevin, Thank you so much for your helpful response. First, I normalized vcf file by bcftools norm –m-any and tried your command on the very small vcf file containing just 3 samples and found a strange thing. One of rows in my vcf file is: 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 Your suggested command calculated the hetero and homo count as: H R POS ID REF ALT AltHetCount AltHomCount RefHomCount 20 1110696 rs6040355 A G 2 0 1 20 1110696 rs6040355 A T 2 1 0 As it is obvious, there is 2 AltHet and 1 AltHom, but based on the above output of the command, 1 genotype was considered as RefHom, which is not correct. Could you please kindly tell me how we can interpret this issue? For obtaining AF and MAF from vcf file, I tried several ways, but none of them useful. I’m so grateful if you have any command /tool to calculate AF and MAF, please kindly share with me. Thanks in advance ADD REPLY 0 Entering edit mode Hey, oh, that's because your organism is non-diploid. Your genotypes (GTs) are non-standard. My script is for diploid organisms. I could modify the script, or you could just try the BCFtools plugin that calculates AF and MAF: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF? ADD REPLY 0 Entering edit mode No, my organism is human, and it is a multiallelic site. so, your script is fit for me, but how we can handle such sites? Thanks a lot for the link, I try it for AF and MAF. ADD REPLY 0 Entering edit mode Ah, I see. The way around that is to split these into separate variant calls with bcftools norm -m-any, and then my script will work. ADD REPLY 0 Entering edit mode Actually, I run the norm command before running the script and as you can find below, the variants splited HR POS ID REF ALT AltHetCount AltHomCount RefHomCount 20 1110696 rs6040355 A G 2 0 1 20 1110696 rs6040355 A T 2 1 0 However, the count in the second row is correct. however, we couldn't check this issue in the vcf file with thousands of variant sits. Have you any suggestions/advice for this issue, please. ADD REPLY 0 Entering edit mode What are the individual records in your VCF after you have split the multi-allelic calls? ADD REPLY 0 Entering edit mode Here, it's the original vcf file at multi-allelic site: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 As you can find, we have not RefHom in the original vcf file. it's normalized vcf file with individual records for multi-allelic site: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 1110696 rs6040355 A G 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|0:21:6:23,27 0|1:2:0:18,2 0/0:35:4:. 20 1110696 rs6040355 A T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 0|1:21:6:23,27 1|0:2:0:18,2 1/1:35:4:. and I posted the output of your command on the normalized vcf file in my previous post. Thank you very much for following the post. ADD REPLY 0 Entering edit mode Thanks for posting. So, my function has calculated it correctly: • For A>G, there are 2 het variants, 0 hom variants, and 1 hom reference • For A>T, there are 2 het variants, 1 hom variant, and 0 hom reference ADD REPLY 0 Entering edit mode No problem. I also try to get this information from gatk (variants to table function), and it returned me 2 AltHet and 1AltHom, like what we can see in the vcf file (original vcf file). Yes, your script is correct based on normalized vcf file. however, I'm confused with real (original) vcf file as homRef really isn't in this file? sorry, but it will be great if you please tell me how we can interpret this discrepancy and how we have homRef in the normalized vcf file? Thanks ADD REPLY 0 Entering edit mode Homozygous reference is inferred by the 0/0 genotype call ADD REPLY 0 Entering edit mode Thanks Kevin, of course, I know it. so sorry, My question is: we have not any ref allele in the real (original) vcf file (all calls are as 1|2, 2|1 and 2/2). So, how it is generated in the normalized vcf file during splitting and counted by your script, what information used for calling this homRef in the normalized vcf file? I'm totally confused as all genotype calls in normalized vcf file differs with the original, could you please kindly tell me if we can say these call (in normalized vcf file) actually exist in the sample? sorry for further questions and many thanks for clearing me ADD REPLY 0 Entering edit mode If we look at the variant call before the normalisation: 20 1110696 rs6040355 A G,T 67 AF=0.333,0.667 1|2 2|1 2/2 This says that, in your 3 samples, there were 2 different variants called at this position: • A>G, with AF=0.333 • A>T, with AF=0.667 It follows, then: • A>G is encoded as GT 1 • A>T is encoded as GT 2 So, your sample genotypes are: • 1|2 = GT • 2|1 = TG • 2/2 = TT ## --------------------------------------------- Your concern, then, about the normalisation is quite valid and reflects a limitation of the VCF format. For example, for the first sample, 1|2 is split into 1|0 and 0|1, which is not reflective of the actual genotype. This is partly why we sometimes just exclude these sites from analyses; indeed, many programs do not even support multi-allelic sites. There is another discussion here about it: https://gatkforums.broadinstitute.org/gatk/discussion/3148/how-to-deal-with-multiallelic-sites-in-the-vcf ADD REPLY 0 Entering edit mode Thank you, Kevin for your explanation and link. so, the multi-allelic site has a long story. However, as I mentioned the gatk (variants to table) correctly computed hetero and homo at these positions, however, this tool doesn't say which samples are homo and which are hetero, it may be a good idea to modify your script for multiallelic sites, I think. ADD REPLY 0 Entering edit mode Thank you. I may modify it later. Today I am working on CPU parallelisation and also have to go to the supermarket. ADD REPLY 0 Entering edit mode May I ask, seta, if you definitively require my help with this?, that is, are you waiting for me to modify my script? ADD REPLY 0 Entering edit mode Hi Kevin, Many thanks for your kind message. I'm going to use GATK for this purpose, however, as I mentioned before, this tool doesn't tell us which samples are homo and which are hetero, unlike your script. So, I'm happy if you please let me know when you modify your script. Kind regards, Seta ADD REPLY 0 Entering edit mode Hi seta, you could try this modified script, if you wish. It is for mono- and bi-allelic records (if you have tri- or quad-allelic, it won't print anything): bcftools view test.vcf.gz | \ awk -F"\t" '{line=$0} BEGIN {
print "CHR\tPOS\tID\tREF\tALT\tAltHetCount\tAltHomCount\tRefHomCount"
} !/^#/ {
if (gsub(/,/, ",", $5)==0) { print$1"\t"$2"\t"$3"\t"$4"\t"$5"\t" gsub(/0\|1|1\|0|0\/1|1\/0/,"") "\t" gsub(/1\/1|1\|1/,"") "\t" gsub(/0\/0|0\|0/,"")
} else if (gsub(/,/, ",", $5)==1) { print$1"\t"$2"\t"$3"\t"$4"\t"$5"\t" gsub(/1\/0|0\/1|1\|0|0\|1|1\|2|2\|1|1\/2|2\/1/,"")","gsub(/2\/0|0\/2|2\|0|0\|2|1\|2|2\|1|1\/2|2\/1/,"",line) "\t" gsub(/1\/1|1\|1/,"")","gsub(/2\/2|2\|2/,"") "\t" gsub(/0\/0|0\|0/,"")
}
}'

CHR POS         ID          REF ALT  AltHetCount    AltHomCount RefHomCount
16  56380495    rs78560610  G   A    1              0           10
16  56380502    rs3811361   G   A    3              0           8
16  56380570    rs3811360   A   G    6              0           5
16  56381544    rs11297695  CT  C    6              2           3
16  56381677    .           A   C    1              0           10
16  56381699    rs11640263  C   T    7              2           2
16  56388821    rs148650019 A   G    1              0           10
16  56389119    rs143245214 T   TTA  3              3           5
16  56389827    rs5817056   CA  C    5              1           5
16  56390685    .           CTT C,CT 4,9            0,0         1
16  56390968    rs34097380  GA  G    4              2           5
16  56396430    rs3748401   C   A    1              0           10
16  56396486    rs4924      C   T    5              4           2
16  56397153    rs2617846   G   A    2              5           3
16  56397174    rs2617847   C   G    2              4           3
16  56921684    .           T   TA,A 3,8            1,1         1

Whilst I have tested this, you should also test it before you are sure that it functions correctly.

1
Entering edit mode

Hi Kevin,

Thanks for backing to me. I tested it on an example vcf file and it worked well on the vcf file without normalization.

Best,

Seta