Question

filling of missing genotype information in merged variant call vcf file

1

Entering edit mode

5.9 years ago

princy149 ▴ 80

hi, I have two conditions of vcf file as one is merged Variant call VCF file of 50 samples and other is individual 50 genomic VCF files, I want to fill "missing genotype "information in merged VCF file using "REF "and "ALT" of individual 50 samples VCF file by using some program as "Perl" etc. so anyone have idea about this.

Thank you,

SNP • 6.4k views

ADD COMMENT • link 5.9 years ago by princy149 ▴ 80

2

Entering edit mode

see:

ADD REPLY • link 5.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

It sounds like you want to do an 'imputation'. Can you confirm this?

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

hi kevin, thank you for your reply. I have exactly no idea about "Imputation" for this case. here i will explain you my process as firstly i make "variant call vcf files for each 50 samples separately using "bcftool" and after that I merged the all 50 samples "variant call vcf files" but in merged file I got various "missing genotype calls" in samples at different locai look like as" ./.:." so I need to filled these information by using "Ref " and "ALT" information. for this purpose I made "genomic VCF" file for each samples using their "sorted BAM" files and "reference genome fasta " using samtools.

samtools mpileup -v -t AD -u -f  reference fasta  sample sorted.bam –o  sample_mapped.vcf

after making genomic vcf file , I want to create a program using Perl or Python in which I can use 50 samples "genomic VCF" file (as separate files) and 50 Samples "merged variant call vcf file" matched together and create genotype for "missing GT" info in "merged variant call vcf file). please help me for this.

Thank you,

ADD REPLY • link updated 5.9 years ago by finswimmer 16k • written 5.9 years ago by princy149 ▴ 80

0

Entering edit mode

did you look at bcftools annotate function?

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

Thank you Pierre for valuable links. I read your written program " https://github.com/lindenb/jvarkit/wiki/FixVcfMissingGenotypes" but is it possible when I can use "genomic VCF files " instead of BAM files and is there any options to make program in Perl or Python or R instead of java. I have condition for making of program for filling missing GT info used both type of vcf files.

ADD REPLY • link 5.9 years ago by princy149 ▴ 80

0

Entering edit mode

ah, I see; you I would first use GATK combinevariant to merge both VCF to get both genotypes for the same variant.

ADD REPLY • link 5.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you Pierre for reply.

ADD REPLY • link 5.9 years ago by princy149 ▴ 80

0

Entering edit mode

Is anyone has idea to solve this problem using perl or python like PyVCF etc..

ADD REPLY • link 5.9 years ago by princy149 ▴ 80

0

Entering edit mode

Can you paste a quick example of exactly what you want to do? Just reading textual descriptions is not enough to grasp what one wants, at times.

Paste what you have right now, and then what you would like to have.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

hi,kevin

Sorry for my late post.

this is my merged variant call vcf file data. here as you can see many "./.:." GT positions for samples so I want to fill this by using individual sample gVCF data using its "REF/ALT" info for filling of no GT info in merged vcf file.

Chr18   28031863    .   G   A   222.003 PASS VDB=0.851946;SGB=-0.69168;MQSB=1;MQ0F=0;AF1=1;AC1=2;MQ=37;FQ=-83.9864;DP=811;DP4=0,0,430,344   GT:PL   1/1:255,57,0    1/1:255,48,0    1/1:255,75,0    1/1:255,75,0    1/1:255,75,0    1/1:255,54,0    1/1:255,75,0    1/1:255,84,0    1/1:255,51,0    1/1:255,84,0    1/1:255,63,0    1/1:255,75,0    1/1:255,78,0    1/1:255,60,0    1/1:255,57,0    1/1:255,36,0    1/1:255,48,0    1/1:255,78,0    1/1:255,51,0    1/1:255,42,0    1/1:255,36,0    1/1:238,30,0    1/1:255,66,0    1/1:255,51,0    1/1:255,75,0    1/1:255,75,0    1/1:255,48,0    1/1:255,54,0    1/1:255,45,0    1/1:251,30,0    1/1:255,72,0    1/1:255,39,0    1/1:255,39,0    1/1:255,81,0    1/1:255,54,0    1/1:255,48,0    1/1:255,54,0    1/1:255,33,0    1/1:255,42,0    1/1:252,33,0    1/1:255,48,0    1/1:60,3,0
Chr18   28031890    .   T   C   222 PASS    VDB=0.356154;SGB=-0.683931;MQSB=1;MQ0F=0;AF1=1;AC1=2;MQ=37;FQ=-65.9862;|;DP=15;DP4=0,0,5,8  GT:PL   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   1/1:255,39,0    ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.

ADD REPLY • link updated 5.9 years ago by Kevin Blighe 87k • written 5.9 years ago by princy149 ▴ 80

1

Entering edit mode

Did you not try Pierre's program: A: Back-filling missing genotypes in merged VCF ?

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

This is "gVCF " format of one sampple, contains variant and non variant both sits.

Chr01   229 .   G   <*> 0   .   DP=91;I16=48,43,0,0,3625,145089,0,0,3283,119371,0,0,1834,42368,0,0;QS=1,0;MQSB=0.894166;MQ0F=0  PL:AD   0,255,255:91,0
Chr01   230 .   T   <*> 0   .   DP=93;I16=47,44,0,0,3399,129511,0,0,3283,119371,0,0,1811,42201,0,0;QS=1,0;MQSB=0.882348;MQ0F=0  PL:AD   0,255,255:91,0
Chr01   231 .   T   G,<*>   0   .   DP=92;I16=48,41,0,1,3391,130707,32,1024,3221,117377,37,1369,1794,41482,25,625;QS=0.989838,0.010162,0;SGB=-0.379885;RPB=1;MQB=1;MQSB=0.9585;BQB=1;MQ0F=0 PL:AD   0,239,255,255,255,255:89,1,0
Chr01   232 .   T   <*> 0   .   DP=93;I16=48,43,0,0,3506,137206,0,0,3295,120115,0,0,1800,41664,0,0;QS=1,0;MQSB=0.955398;MQ0F=0  PL:AD   0,255,255:91,0
Chr01   233 .   A   <*> 0   .   DP=91;I16=47,42,0,0,3458,136066,0,0,3233,118121,0,0,1808,41814,0,0;QS=1,0;MQSB=0.991392;MQ0F=0  PL:AD   0,255,255:89,0
Chr01   234 .   G   <*> 0   .   DP=91;I16=48,42,0,0,3571,142473,0,0,3258,118746,0,0,1850,42960,0,0;QS=1,0;MQSB=0.9585;MQ0F=0    PL:AD   0,255,255:90,0
Chr01   235 .   A   <*> 0   .   DP=91;I16=49,41,0,0,3558,141784,0,0,3258,118746,0,0,1840,42484,0,0;QS=1,0;MQSB=0.964891;MQ0F=0  PL:AD   0,255,255:90,0
Chr01   236 .   A   <*> 0   .   DP=93;I16=49,44,0,0,3677,146381,0,0,3369,122853,0,0,1866,43104,0,0;QS=1,0;MQSB=0.955969;MQ0F=0  PL:AD   0,255,255:93,0
Chr01   237 .   A   <*> 0   .   DP=93;I16=49,43,0,0,3654,146344,0,0,3332,121484,0,0,1842,42442,0,0;QS=1,0;MQSB=0.958948;MQ0F=0  PL:AD   0,255,255:92,0
Chr01   239 .   A   C,<*>   0   DP=92;I16=47,44,1,0,3625,145029,22,484,3295,120115,37,1369,1864,43082,4,16;QS=0.993299,0.00670119,0;SGB=-0.379885;RPB=1;MQB=1;MQSB=0.952299;BQB=1;MQ0F=0    PL:AD   0,254,255,255,255,255:91,1,0
Chr01   240 .   A   <*> 0   .   DP=91;I16=47,42,0,0,3495,138197,0,0,3221,117377,0,0,1850,42860,0,0;QS=1,0;MQSB=0.954818;MQ0F=0  PL:AD   0,255,255:89,0
Chr01   241 .

ADD REPLY • link updated 5.9 years ago by Kevin Blighe 87k • written 5.9 years ago by princy149 ▴ 80

0

Entering edit mode

Thank you kevin for your prompt reply. I want to use individual gVCF files instead of Bam files for filling of GT info where it is not found. And also want to use Perl instead of java. This is a condition for me.

ADD REPLY • link 5.9 years ago by princy149 ▴ 80

1

Entering edit mode

I do not doubt that this is possible using perl or some customised solution, but you should not heavily restrict yourself to any one particular language. I don't know if there are many Perl programmers on Biostars.

If I get free time, I may do this in AWK or Python for you, but I am also pressed for time.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you very much kevin for your reply. I will appreciate you for helping me and giving your precious time for this query. looking forward for your solutions for this.

ADD REPLY • link 5.9 years ago by princy149 ▴ 80

0

Entering edit mode

Just to be aware: most of the genotypes that you've listed in the gVCF have an asterisk. These can potentially be removed.

Please read here: What does <*> mean in a vcf file?

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

A solution for you is to just merge your data with the gVCF using bcftools merge, and use the following flag:

-0  --missing-to-ref               assume genotypes at missing sites are 0/0

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

By the way, here is a simple AWK script for parsing VCFs, in case you want to learn:

awk -F"\t" 'BEGIN {""} FNR==NR && !/^#/ {a[$1$2]=$4; next} !/^#/ {if (a[$1$2]) {print "Position: chr"$1":"$2", reference genotype "a[$1$2]" available in Ref.vcf"} else {print "NOT available"}}' Ref.vcf test.vcf | head

NOT available
NOT available
NOT available
NOT available
NOT available
Position: chr1:65974, reference genotype A available in Ref.vcf
NOT available
NOT available
NOT available
Position: chr1:69428, reference genotype T available in Ref.vcf

It just looks up positions from test.vcf in Ref.vcf. If the positions match, it displays the reference genotype that's available (i.e. the one that could be added to missing genotypes).

To be honest, though, I think that all that you need is the BCFtools command that I mentioned (above)

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you so much Kevin, you'r really very helpful. I appreciate you very much for your hard work to making solution of my problem.

I will sure Try This.

Thank you again :)

ADD REPLY • link 5.9 years ago by princy149 ▴ 80

0

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

Good luck using gvcfs and perl for this purpose, but if you have requirements like that I suggest you figure it out on your own.

ADD REPLY • link 5.9 years ago by WouterDeCoster 47k