Question: How to add custom fields to the VCF files?
1
gravatar for kirannbishwa01
14 months ago by
United States
kirannbishwa01590 wrote:

I need to add a context dependent custom tags in the FORMAT and SAMPLE column of a vcf file.

Here is my sample file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  
2   5187    .   G   C   2535.90 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/1:47,17:64:99:965,0,1925:1|0:.,.,.,.,.,.,.,.:662:|:0.5    
2   5207    .   T   A   2814.90 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/1:41,30:71:99:955,0,1628:1|0:.,.,.,.,.,.,.,.:662:|:0.5        
2   5238    .   G   C   8321.94 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/1:48,59:107:99:1986,0,1336:0|1:.,.,.,.,.,.,.,.:662:|:0.5      
2   5258    .   T   C   10415.90    PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/1:75,45:120:99:1369,0,2434:1|0:.,.,.,.,.,.,.,.:662:|:0.5

Step 01: Look for phasing information at a particular variant site. For line #1, GT is 0/1 and PG is 1|0

Step 02: I need to take the information from the PG field (phased genotype) and update the vcf by creating a new field HP (haplotype phase) at the end of the FORMAT. The HP is the default output from GATK RBphasing but my other tool cannot output this field.

The output file should be this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  
2   5187    .   G   C   2535.90 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:HP 0/1:47,17:64:99:965,0,1925:1|0:.,.,.,.,.,.,.,.:662:|:0.5:5187-2,5187-1 
2   5207    .   T   A   2814.90 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:HP 0/1:41,30:71:99:955,0,1628:1|0:.,.,.,.,.,.,.,.:662:|:0.5:5207-2,5207-1     
2   5238    .   G   C   8321.94 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:HP 0/1:48,59:107:99:1986,0,1336:0|1:.,.,.,.,.,.,.,.:662:|:0.5:5238-1,5238-2       
2   5258    .   T   C   10415.90    PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:HP 0/1:75,45:120:99:1369,0,2434:1|0:.,.,.,.,.,.,.,.:662:|:0.5:5258-2,5258-1

Explanation of the process: 1) We need to look at the PG field first and see if the particular locus is phased with any variants before-after it. If its phased the value for the PG field will have a pipe (|). In the first line the GT (genotype) is 0/1 (heterozygous) and the PG is 1|0.

2) Identify the positions (POS) of the heterozygous site. For the first line its 5187. So, we need to pick this value and connect to value with '-' to the value obtained in step 3. - this steps store the value of the PG field as ':5187-'

3) So, with PG 1|0, the HP field should indicate the alternate allele (1) is reported first with value 2. - this step store the value for the field as ':5178-2'

4) Now, complete the HP field to indicate the value of another allele, with reported phase value as 1. - this step stores the value for the PG field as ':5178-2,5187-1'

So, GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/1:47,17:64:99:965,0,1925:1|0:.,.,.,.,.,.,.,.:662:|:0.5
translates to GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC :HP 0/1:47,17:64:99:965,0,1925:1|0:.,.,.,.,.,.,.,.:662:|:0.5:5187-2,5187-1

Note: - At some site the variant might be 1/2 (or say 1|2 in PG), so this needs to be considered too. - I tried to looks for several variant annotation software including PESEQ, Annnovar, vcftools etc. but not finding any thing that can specifically do this. If there is any such tool that is even more great. - It would be best if the program can be in python with description of what is happening at each step. I am trying to learn python so that would be of great help too.

Thanks much in advance ! - Bishwa K.

ADD COMMENTlink modified 14 months ago by Biostar ♦♦ 20 • written 14 months ago by kirannbishwa01590

At the mean time I am trying to learn pyvcf. But, I would appreciate any programming help specifically in regards to the given question. I would want this solution sooner than later, so.

Thanks,

ADD REPLYlink written 14 months ago by kirannbishwa01590

I provided a template.

ADD REPLYlink written 14 months ago by geek_y8.1k
3
gravatar for dshulgin
14 months ago by
dshulgin250
Belarus
dshulgin250 wrote:

https://github.com/dshulgin2015/computing_python/blob/master/issue_1.py

Works well, if someone else wants to do the same.

ADD COMMENTlink written 14 months ago by dshulgin250
2
gravatar for geek_y
14 months ago by
geek_y8.1k
Barcelona/London
geek_y8.1k wrote:

It should be fairly simple with any VCF parsers. Look at pysam, pyVCF

Here is a template:

First you need to add HP to the header.

import pysam

#read the input file
myvcf=pysam.VariantFile("in.vcf","r")

# Add the HP field to header. Say its a string and can take any no. of values. It depends what format you want to give.
myvcf.header.formats.add("HP",".","String","Some description")

# An example showing how to add 'HP' information. Please explore further.    
for variant in myvcf:
    for sample in variant.samples:
        variant.samples[sample]['HP']="Hola"
    print variant

.

Input: GT:AD:DP:GQ:PL    0/1:12,13:25:99:358,0,318
Output: GT:AD:DP:GQ:PL:HP    0/1:12,13:25:99:358,0,318:Hola
ADD COMMENTlink modified 14 months ago • written 14 months ago by geek_y8.1k

Thanks, but I think I need to add the HP field. There as a tab inserted infront of Hp field in my previous sample data, it was just a typo.

I will try what I can do with pyVCF. I really need to learn to do some python programming.

Thanks,

ADD REPLYlink written 14 months ago by kirannbishwa01590
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1662 users visited in the last hour