Question: filling of missing genotype information in merged variant call vcf file
1
gravatar for princy149
10 weeks ago by
princy14960
princy14960 wrote:

hi, I have two conditions of vcf file as one is merged Variant call VCF file of 50 samples and other is individual 50 genomic VCF files, I want to fill "missing genotype "information in merged VCF file using "REF "and "ALT" of individual 50 samples VCF file by using some program as "Perl" etc. so anyone have idea about this.

Thank you,

snp • 356 views
ADD COMMENTlink modified 9 weeks ago • written 10 weeks ago by princy14960

It sounds like you want to do an 'imputation'. Can you confirm this?

ADD REPLYlink written 10 weeks ago by Kevin Blighe25k

hi kevin, thank you for your reply. I have exactly no idea about "Imputation" for this case. here i will explain you my process as firstly i make "variant call vcf files for each 50 samples separately using "bcftool" and after that I merged the all 50 samples "variant call vcf files" but in merged file I got various "missing genotype calls" in samples at different locai look like as" ./.:." so I need to filled these information by using "Ref " and "ALT" information. for this purpose I made "genomic VCF" file for each samples using their "sorted BAM" files and "reference genome fasta " using samtools.

samtools mpileup -v -t AD -u -f  reference fasta  sample sorted.bam –o  sample_mapped.vcf

after making genomic vcf file , I want to create a program using Perl or Python in which I can use 50 samples "genomic VCF" file (as separate files) and 50 Samples "merged variant call vcf file" matched together and create genotype for "missing GT" info in "merged variant call vcf file). please help me for this.

Thank you,

ADD REPLYlink modified 9 weeks ago by finswimmer4.4k • written 10 weeks ago by princy14960

did you look at bcftools annotate function?

ADD REPLYlink written 9 weeks ago by cpad01128.3k

Thank you Pierre for valuable links. I read your written program " https://github.com/lindenb/jvarkit/wiki/FixVcfMissingGenotypes" but is it possible when I can use "genomic VCF files " instead of BAM files and is there any options to make program in Perl or Python or R instead of java. I have condition for making of program for filling missing GT info used both type of vcf files.

ADD REPLYlink written 10 weeks ago by princy14960

ah, I see; you I would first use GATK combinevariant to merge both VCF to get both genotypes for the same variant.

ADD REPLYlink written 10 weeks ago by Pierre Lindenbaum111k

Thank you Pierre for reply.

ADD REPLYlink written 10 weeks ago by princy14960

Is anyone has idea to solve this problem using perl or python like PyVCF etc..

ADD REPLYlink written 10 weeks ago by princy14960

Can you paste a quick example of exactly what you want to do? Just reading textual descriptions is not enough to grasp what one wants, at times.

Paste what you have right now, and then what you would like to have.

ADD REPLYlink written 10 weeks ago by Kevin Blighe25k

hi,kevin

Sorry for my late post.

this is my merged variant call vcf file data. here as you can see many "./.:." GT positions for samples so I want to fill this by using individual sample gVCF data using its "REF/ALT" info for filling of no GT info in merged vcf file.

Chr18   28031863    .   G   A   222.003 PASS VDB=0.851946;SGB=-0.69168;MQSB=1;MQ0F=0;AF1=1;AC1=2;MQ=37;FQ=-83.9864;DP=811;DP4=0,0,430,344   GT:PL   1/1:255,57,0    1/1:255,48,0    1/1:255,75,0    1/1:255,75,0    1/1:255,75,0    1/1:255,54,0    1/1:255,75,0    1/1:255,84,0    1/1:255,51,0    1/1:255,84,0    1/1:255,63,0    1/1:255,75,0    1/1:255,78,0    1/1:255,60,0    1/1:255,57,0    1/1:255,36,0    1/1:255,48,0    1/1:255,78,0    1/1:255,51,0    1/1:255,42,0    1/1:255,36,0    1/1:238,30,0    1/1:255,66,0    1/1:255,51,0    1/1:255,75,0    1/1:255,75,0    1/1:255,48,0    1/1:255,54,0    1/1:255,45,0    1/1:251,30,0    1/1:255,72,0    1/1:255,39,0    1/1:255,39,0    1/1:255,81,0    1/1:255,54,0    1/1:255,48,0    1/1:255,54,0    1/1:255,33,0    1/1:255,42,0    1/1:252,33,0    1/1:255,48,0    1/1:60,3,0
Chr18   28031890    .   T   C   222 PASS    VDB=0.356154;SGB=-0.683931;MQSB=1;MQ0F=0;AF1=1;AC1=2;MQ=37;FQ=-65.9862;|;DP=15;DP4=0,0,5,8  GT:PL   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   1/1:255,39,0    ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.
ADD REPLYlink modified 9 weeks ago by Kevin Blighe25k • written 9 weeks ago by princy14960
1

Did you not try Pierre's program: A: Back-filling missing genotypes in merged VCF ?

ADD REPLYlink written 9 weeks ago by Kevin Blighe25k

This is "gVCF " format of one sampple, contains variant and non variant both sits.

Chr01   229 .   G   <*> 0   .   DP=91;I16=48,43,0,0,3625,145089,0,0,3283,119371,0,0,1834,42368,0,0;QS=1,0;MQSB=0.894166;MQ0F=0  PL:AD   0,255,255:91,0
Chr01   230 .   T   <*> 0   .   DP=93;I16=47,44,0,0,3399,129511,0,0,3283,119371,0,0,1811,42201,0,0;QS=1,0;MQSB=0.882348;MQ0F=0  PL:AD   0,255,255:91,0
Chr01   231 .   T   G,<*>   0   .   DP=92;I16=48,41,0,1,3391,130707,32,1024,3221,117377,37,1369,1794,41482,25,625;QS=0.989838,0.010162,0;SGB=-0.379885;RPB=1;MQB=1;MQSB=0.9585;BQB=1;MQ0F=0 PL:AD   0,239,255,255,255,255:89,1,0
Chr01   232 .   T   <*> 0   .   DP=93;I16=48,43,0,0,3506,137206,0,0,3295,120115,0,0,1800,41664,0,0;QS=1,0;MQSB=0.955398;MQ0F=0  PL:AD   0,255,255:91,0
Chr01   233 .   A   <*> 0   .   DP=91;I16=47,42,0,0,3458,136066,0,0,3233,118121,0,0,1808,41814,0,0;QS=1,0;MQSB=0.991392;MQ0F=0  PL:AD   0,255,255:89,0
Chr01   234 .   G   <*> 0   .   DP=91;I16=48,42,0,0,3571,142473,0,0,3258,118746,0,0,1850,42960,0,0;QS=1,0;MQSB=0.9585;MQ0F=0    PL:AD   0,255,255:90,0
Chr01   235 .   A   <*> 0   .   DP=91;I16=49,41,0,0,3558,141784,0,0,3258,118746,0,0,1840,42484,0,0;QS=1,0;MQSB=0.964891;MQ0F=0  PL:AD   0,255,255:90,0
Chr01   236 .   A   <*> 0   .   DP=93;I16=49,44,0,0,3677,146381,0,0,3369,122853,0,0,1866,43104,0,0;QS=1,0;MQSB=0.955969;MQ0F=0  PL:AD   0,255,255:93,0
Chr01   237 .   A   <*> 0   .   DP=93;I16=49,43,0,0,3654,146344,0,0,3332,121484,0,0,1842,42442,0,0;QS=1,0;MQSB=0.958948;MQ0F=0  PL:AD   0,255,255:92,0
Chr01   239 .   A   C,<*>   0   DP=92;I16=47,44,1,0,3625,145029,22,484,3295,120115,37,1369,1864,43082,4,16;QS=0.993299,0.00670119,0;SGB=-0.379885;RPB=1;MQB=1;MQSB=0.952299;BQB=1;MQ0F=0    PL:AD   0,254,255,255,255,255:91,1,0
Chr01   240 .   A   <*> 0   .   DP=91;I16=47,42,0,0,3495,138197,0,0,3221,117377,0,0,1850,42860,0,0;QS=1,0;MQSB=0.954818;MQ0F=0  PL:AD   0,255,255:89,0
Chr01   241 .
ADD REPLYlink modified 9 weeks ago by Kevin Blighe25k • written 9 weeks ago by princy14960

Thank you kevin for your prompt reply. I want to use individual gVCF files instead of Bam files for filling of GT info where it is not found. And also want to use Perl instead of java. This is a condition for me.

ADD REPLYlink written 9 weeks ago by princy14960
1

I do not doubt that this is possible using perl or some customised solution, but you should not heavily restrict yourself to any one particular language. I don't know if there are many Perl programmers on Biostars.

If I get free time, I may do this in AWK or Python for you, but I am also pressed for time.

ADD REPLYlink written 9 weeks ago by Kevin Blighe25k

Thank you very much kevin for your reply. I will appreciate you for helping me and giving your precious time for this query. looking forward for your solutions for this.

ADD REPLYlink written 9 weeks ago by princy14960

Just to be aware: most of the genotypes that you've listed in the gVCF have an asterisk. These can potentially be removed.

Please read here: What does <*> mean in a vcf file?

ADD REPLYlink written 9 weeks ago by Kevin Blighe25k

A solution for you is to just merge your data with the gVCF using bcftools merge, and use the following flag:

-0  --missing-to-ref               assume genotypes at missing sites are 0/0
ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Kevin Blighe25k

By the way, here is a simple AWK script for parsing VCFs, in case you want to learn:

awk -F"\t" 'BEGIN {""} FNR==NR && !/^#/ {a[$1$2]=$4; next} !/^#/ {if (a[$1$2]) {print "Position: chr"$1":"$2", reference genotype "a[$1$2]" available in Ref.vcf"} else {print "NOT available"}}' Ref.vcf test.vcf | head

NOT available
NOT available
NOT available
NOT available
NOT available
Position: chr1:65974, reference genotype A available in Ref.vcf
NOT available
NOT available
NOT available
Position: chr1:69428, reference genotype T available in Ref.vcf

It just looks up positions from test.vcf in Ref.vcf. If the positions match, it displays the reference genotype that's available (i.e. the one that could be added to missing genotypes).

To be honest, though, I think that all that you need is the BCFtools command that I mentioned (above)

ADD REPLYlink written 9 weeks ago by Kevin Blighe25k
1

Thank you so much Kevin, you'r really very helpful. I appreciate you very much for your hard work to making solution of my problem.

I will sure Try This.

Thank you again :)

ADD REPLYlink written 9 weeks ago by princy14960

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

Good luck using gvcfs and perl for this purpose, but if you have requirements like that I suggest you figure it out on your own.

ADD REPLYlink written 9 weeks ago by WouterDeCoster31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 697 users visited in the last hour