Question

How to write the variant (allele) information back into a vcf file?

3

Entering edit mode

7.0 years ago

kirannbishwa01 ★ 1.6k

I recently did a haplotype phasing and stitching of the haplotype blocks. I now want to write this information back on the vcf file. I have dealt with reading and mining vcf file, but not writing. Also, the documentation on PyVCF manual isn't very detailed.

So, this is my few datalines from my text file:

contig  pos ref haplotype_block hap_X   hap_Y   log_odds_ratio  My_PG   Sp_PG   My_hap  Sp_hap
2   2969    C   3706    T   C   -1.902  0   1   C   T
2   3222    G   3706    G   C   -   1   0   C   G
2   3416    G   3706    A   G   -   0   1   G   A
2   5207    T   1856    T   A   2.225   0   1   T   A
2   5238    G   1856    C   G   -   1   0   C   G
2   5398    A   1856    A   G   -   0   1   A   G
2   5403    A   1856    A   G   -   0   1   A   G
2   5426    C   1856    C   A   -   0   1   C   A
2   5434    A   1856    A   G   -   0   1   A   G
2   5467    C   1856    A   C   -   1   0   A   C
2   5483    A   1856    A   C   -   0   1   A   C
2   6636    C   892 T   C   2.238   1   0   T   C
2   6740    A   892 A   G   -   0   1   A   G
2   12138   T   892 T   A   -   0   1   T   A
2   12160   G   892 G   A   -   0   1   G   A
2   12237   G   892 G   C   -   0   1   G   C
2   12298   T   892 T   G   -   0   1   T   G
2   12301   T   892 T   A   -   0   1   T   A

I want to > read a vcf file > match the contig and position > then write the new data into the FORMAT and SAMPLE fields.

Instead of replacing/updating the old tag values I want to create new tag in FORMAT field and insert the corresponding value in SAMPLE field:

haplotype_block as HB
log_of_odds_ratio as LOD
phased_genotype as PG and insert the My_PG|Sp_PG
phased_allele as PA and insert My_hap|Sp_hap

I read the template as:

import vcf

# reader template to burrow
vcf_read = vcf.Reader(open('RNA_seq.test.vcf'))

# file to write
vcf_write = vcf.Writer(open('/dev/null', 'w'), vcf_file)

# now read and update the record
for record in vcf_read:
    sample_id = record.genotype('2ms04h')
    vcf_write.write_record()
    # How to proceed from here?

I want to update the record for sample named 2ms04h which is in the vcf_read
How can I read the data from above text file (chr, pos) and insert it into vcf_write.
Can I write a new file name for vcf_write.

The pyvcf manual isn't very succint on this matter and no example hits from google. Any help appreciated.

Thanks,

vcf SNP allele genome next-gen • 5.0k views

ADD COMMENT • link updated 6.9 years ago by Petr Ponomarenko ★ 2.8k • written 7.0 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

I haven't done this myself, as I don't typically ever write out to VCF files, just process existing and do something with them, or add them to a database to work with. I would think that you need to form the header information yourself for a valid VCF and output it. Any changes you want for the FORMAT fields I think you'll have to do by changing the VCF.model record itself while looping before you output it with VCF.Write().

ADD REPLY • link 7.0 years ago by DG 7.3k

0

Entering edit mode

Do you have to use pyvcf?

Could you please give a snippet of your input vcf file and a few lines of desired output.

Is it multisample vcf file? If so, are you planning to add the same type of information for other samples later?

Thank you

ADD REPLY • link 6.9 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

hi @Petr: It doesn't have to be pyvcf, but since I had been using this for mining the data I thought I better stick with it. I will shortly share the a short snippet of my vcf file and another snippet of the haplotype file (the information I want to write back on vcf).

Thanks much !

ADD REPLY • link 6.9 years ago by kirannbishwa01 ★ 1.6k

1

Entering edit mode

HI petr,

Here is a little information from my.vcf except the main header:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  2ms02g  2ms03g  2ms04h
2   422 .   C   T   584.10  PASS    AC=0;AF=0.00;AN=0;DP=0;ExcessHet=3.0103;FS=0.000;MQ=58.40;QD=32.45;SOR=0.693;set=Intersection   GT:AD:DP:PG:PL:PW   ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./.
2   475 .   A   G   748.10  PASS    AC=0;AF=0.00;AN=0;DP=0;ExcessHet=3.0103;FS=0.000;MQ=60.00;QD=28.15;SOR=1.230;set=Intersection   GT:AD:DP:PG:PL:PW   ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./.
2   620 .   G   C   156.37  PASS    AC=0;AF=0.00;AN=0;DP=0;ExcessHet=3.0103;FS=0.000;MQ=60.00;QD=35.37;SOR=0.693;set=Intersection   GT:AD:DP:PG:PL:PW   ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./.
2   626 .   C   G   141.11  PASS    AC=0;AF=0.00;AN=0;DP=0;ExcessHet=3.0103;FS=0.000;MQ=60.00;QD=30.78;SOR=0.693;set=Intersection   GT:AD:DP:PG:PL:PW   ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./.
2   630 .   A   T   99.10   PASS    AC=0;AF=0.00;AN=0;BaseQRankSum=0.674;ClippingRankSum=0.00;DP=0;ExcessHet=3.0103;FS=0.000;MQ=60.00;MQRankSum=0.00;QD=24.78;ReadPosRankSum=0.319;SOR=0.446;set=Intersection   GT:AD:DP:PG:PL:PW   ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./.
2   631 .   G   GT  90.06   PASS    AC=0;AF=0.00;AN=0;BaseQRankSum=1.15;ClippingRankSum=0.00;DP=0;ExcessHet=3.0103;FS=0.000;MQ=60.00;MQRankSum=0.00;QD=22.51;ReadPosRankSum=0.319;SOR=0.446;set=Intersection    GT:AD:DP:PG:PL:PW   ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./. ./.:0,0:0:./.:0,0,0:./.

And, say this is my haplotype file:

**haplotype_file.txt** for **sample 2ms04h** is = 

contig  pos ref    My|Sp
2   422 C   C|T
2   475 G   C|G
2   620 G   G|A
2   630 G   C|G
2   631 A   A|G

So, I want to add a field My|Sp in the FORMAT column and write the corresponding Genotype info. While doing so, I will compare the alleles in My|Sp with all alleles (ref+alt) from the my.vcf file.

If C is ref and T is alt (numbered 2) for the position 422. I want write My|Sp as 0|2 in the sample field. For the positions that have no Haplotype (My|Sp) information we can simply update genotype as ./. or N|N.

Thanks,

ADD REPLY • link 6.9 years ago by kirannbishwa01 ★ 1.6k

1

Entering edit mode

Thank you, kirannbishwa01. Could you please make manually desired output snipped that corresponds to your example as I do not understand it yet. In fact, it is even more confusing now. Based on your last comment it apears to me that you want to use the second file with local block phasing to update the first file with genotypes of different samples. In particular, you want to use My|Sp to convert it to proper GT:GT in the vcf and update it. So for the second line, you want to change GT:GT=./. for sample 2ms04h to GT:GT=0|1. Unfortunately, this somewhat contradicts your topic question where you mention My_hap|Sp_hap and other things. Could you please edit your questions and make them consistent and with good representative example of input and output in snippets? Thank you

ADD REPLY • link 6.9 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Yeah, I'm less clear now on what they want to accomplish as well. Looking into this a bit more with the power of google, this has been an open discussion on PyVCF's github issue tracker since 2013.

https://github.com/jamescasbon/PyVCF/issues/82

There are some bits in there and links to forks that may help in addressing the issue, but it would be a bit of a hack job. Seems like there have been some open PRs that were never merged into the main project.

ADD REPLY • link 6.9 years ago by DG 7.3k

0

Entering edit mode

@Dan Gaston: I have read all the issues, Q/A under pyvcf module on Github. Actually, I also navigated through several forks to find the answers. I also put a question in there, but no response so far (2 weeks).

One of the forks had a pratial fix of adding a new field under FORMAT, but it isn't merged in the master module. I had to copy the code and paste in my ubuntu python repo (pyvcf). One thing, I am now able to fix is that, I can add a new FIELD under FORMAT and its corresponding value under SAMPLE, but it only works if I an calculating something from the same line. If I want to read some values from another vcf or say a haplotype files like mine it won't be possible (if possible won't be easy). I am going to see how @Petr can help me.

Thanks,

ADD REPLY • link 6.9 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

As I said, dear kirannbishwa01, the snippets you added contradict your main topic description and does not have desired output. It is hard to help you when I do not understand the task. Maybe other's here can help or you can clarify your questions. If you can make everything more coherent by using "edit" button, I would greatly appreciate it. Thank you. Petr

ADD REPLY • link 6.9 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Why exactly is it not possible/easy? If you can add a field under format and its value under sample I don't see what you are now missing?

ADD REPLY • link 6.9 years ago by DG 7.3k