VCF simplify INFO
3
0
Entering edit mode
7.7 years ago

Hi!

I would like to know if it is any way to simplify my INFO from my vcf file

For example instead have all of this

AF=0.305085;AO=17;DP=126;FAO=36;FDP=118;FR=.,HEALED;FRO=51;FSAF=2;FSAR=34;FSRF=35;FSRR=16;FWDB=-0.00471499;FXX=0.106053;HRUN=2;
LEN=1;MLLD=28.477;QD=9.93622;RBI=0.11082;REFB=-0.0194848;REVB=0.110719;RO=44;SAF=1;SAR=16;SRF=33;SRR=11;SSEN=0;SSEP=0;SSSB=-0.761851;STB=0.926095;STBP=0;
TYPE=snp;VARB=0.0379246;OID=.;OPOS=3421642;OREF=C;OALT=A;OMAPALT=A;FUNC=[{'transcript':'NM_001409.3','gene':'MEGF6','location':'intronic'}];SF=0

I want just type=snp gene and location

Thanks!

vcf • 3.6k views
ADD COMMENT
0
Entering edit mode

thanks for the links, always nice to have to get more infor **

ADD REPLY
2
Entering edit mode
7.7 years ago
import pysam,sys

myvcf=pysam.VariantFile(sys.argv[1],"r")
vcf_out = pysam.VariantFile(sys.argv[2], 'w', header=myvcf.header)

for variant in myvcf:
        for key in variant.info.keys():
                if not (key=="TYPE" or key=="FUNC"):
                        del variant.info[key]
        vcf_out.write(variant)

Assumes single sample VCF. Modify according to your requirement. You need to work on changing the FUNC values to desired output. It should be straight forward if you know a bit of python.

usage: python script.py in.vcf out.vcf

ADD COMMENT
0
Entering edit mode

thanks Goutham, I used python but only once to call virus sequence from unmapped files XD but was all done...I will try and see if I can manage to learn...

ADD REPLY
0
Entering edit mode

THANKS!!!! WORKS LIKE A CHARM!!!!!! **

Hope one day I can learn Phyton

ADD REPLY
0
Entering edit mode

I think I need a bit more help...

How I can change FUNC values? so they just show out gene names and not the rest like I have now?

FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}]

Thanks...I tried to figure out how to do it...but no way :(

ADD REPLY
0
Entering edit mode

Can you post the output of following command ?

grep "^#" in.vcf  | grep "FUNC"

To change the value of INFO field, we need to see what format it is in.

ADD REPLY
0
Entering edit mode

##INFO=<ID=FUNC,Number=.,Type=String,Description="Functional Annotations">

I get that :) Thanks for the help

ADD REPLY
1
Entering edit mode

It can be more elegant, but for simplicity, assuming your 'FUNC' always follow same pattern,

import pysam,sys

myvcf=pysam.VariantFile(sys.argv[1],"r")
vcf_out = pysam.VariantFile(sys.argv[2], 'w', header=myvcf.header)

for variant in myvcf:
        variant.info['FUNC']=variant.info['FUNC'][1].split(":")[1]
        for key in variant.info.keys():
                if not (key=="TYPE" or key=="FUNC"):
                        del variant.info[key]
        vcf_out.write(variant)

The output will be:

FUNC='MEGF6';TYPE='snp'
ADD REPLY
0
Entering edit mode

Thanks a lot Goutham!!!

I tried just now after wake up :P

I get half of it, sometimes appear the gen other another values, I also get this error message

python INFO.py 8.vcf outwork.vcf
Traceback (most recent call last):
File "INFO.py", line 7, in <module>
variant.info['FUNC']=variant.info['FUNC'][1].split(":")[1]
 File "pysam/cbcf.pyx", line 2204, in pysam.cbcf.VariantRecordInfo.__getitem__ (pysam/cbcf.c:29799)
 KeyError: 'Invalid INFO field: FUNC'

a

chr1    871334  .   G   T   1330.9  PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:36:149:149:4:4:145:145:0.973154:55:90:2:2:55:90:2:2
chr1    874815  .   TCCCCCT CTCCCCT,TCCCCTC 104.84  PASS    TYPE=mnp,mnp;FUNC='74.0'    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/2:24:14:14:1:0:3,3:5,5:0.357143,0.357143:3,3:0,0:0:1:3,5:2,0:0:0
chr1    876499  .   A   G   1538.18 PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:73:146:163:0:0:143:163:1:41:102:0:0:52:111:0:0
chr1    877715  .   C   G   149.75  PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:3:18:18:1:1:17:17:0.944444:10:7:0:1:10:7:0:1
chr1    878314  .   G   C   127.32  PASS    TYPE=snp;FUNC='NM_152486.2' GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:99:53:52:25:25:27:27:0.519231:14:13:15:10:14:13:15:10
chr1    880238  .   A   G   940.71  PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:44:98:98:0:0:98:98:1:69:29:0:0:69:29:0:0
chr1    881627  .   G   A   1840.5  PASS    TYPE=snp;FUNC='NM_015658.3' GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:59:201:204:3:4:196:200:0.980392:96:100:2:1:98:102:3:1
ADD REPLY
0
Entering edit mode

so there are variants without FUNC ?

ADD REPLY
0
Entering edit mode

in this case the original is like this:

These files I got them...I dont understand why sometimes appear so many genes? is that normal?

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  
chr1    68928   .   T   <CNV>   100.0   PASS    PRECISE=FALSE;SVTYPE=CNV;END=1262556;LEN=1193628;NUMTILES=615;CONFIDENCE=0;PRECISION=82.7875;FUNC=[{'gene':'OR4F5'},{'gene':'SAMD11'},{'gene':'NOC2L'},{'gene':'KLHL17'},{'gene':'PLEKHN1'},{'gene':'HES4'},{'gene':'ISG15'},{'gene':'AGRN'},{'gene':'RNF223'},{'gene':'C1orf159'},{'gene':'TTLL10'},{'gene':'TNFRSF18'},{'gene':'TNFRSF4'},{'gene':'SDF4'},{'gene':'B3GALT6'},{'gene':'FAM132A'},{'gene':'UBE2J2'},{'gene':'SCNN1D'},{'gene':'ACAP3'},{'gene':'MIR6726'},{'gene':'PUSL1'},{'gene':'CPSF3L'},{'gene':'MIR6727'},{'gene':'GLTPD1'}]  GT:GQ:CN    ./.:0:2
chr1    871334  .   G   T   1330.9  PASS    AF=0.973154;AO=145;DP=149;FAO=145;FDP=149;FR=.;FRO=4;FSAF=90;FSAR=55;FSRF=2;FSRR=2;FWDB=0.0349465;FXX=0;HRUN=2;LEN=1;MLLD=53.9245;QD=35.7289;RBI=0.0349532;REFB=-0.0020043;REVB=6.80506E-4;RO=4;SAF=90;SAR=55;SRF=2;SRR=2;SSEN=0;SSEP=0;SSSB=0.0065273;STB=0.503434;STBP=0.677;TYPE=snp;VARB=2.72043E-4;OID=.;OPOS=871334;OREF=G;OALT=T;OMAPALT=T;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:36:149:149:4:4:145:145:0.973154:55:90:2:2:55:90:2:2
chr1    874815  .   TCCCCCT CTCCCCT,TCCCCTC 104.84  PASS    AF=0.357143,0.357143;AO=3,3;DP=14;FAO=5,5;FDP=14;FR=.,HEALED;FRO=0;FSAF=2,0;FSAR=3,5;FSRF=0;FSRR=0;FWDB=0.188436,0.0894205;FXX=0;HRUN=1,5;LEN=2,2;MLLD=17.4459,16.1569;QD=29.9537;RBI=0.229398,0.204689;REFB=0,0;REVB=0.130827,0.184124;RO=1;SAF=0,0;SAR=3,3;SRF=0;SRR=1;SSEN=0,0;SSEP=0,0;SSSB=0,0;STB=0.5,0.5;STBP=1,1;TYPE=mnp,mnp;VARB=0.138405,0.0578427;OID=.,.;OPOS=874815,874820;OREF=TC,CT;OALT=CT,TC;OMAPALT=CTCCCCT,TCCCCTC;FUNC=[{'normalizedRef':'TC','grantham':'74.0','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'227','location':'exonic','protein':'p.Pro228Ser','sift':'0.68','normalizedPos':'874815','normalizedAlt':'CT','polyphen':'0.98','coding':'c.681_682delTCinsCT','function':['synonymous', 'missense'],'exon':'7'},{'normalizedRef':'CT','grantham':'98.0','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'229','location':'exonic','protein':'p.Pro229Leu','sift':'0.03','normalizedPos':'874820','normalizedAlt':'TC','polyphen':'0.0','coding':'c.686_687delCTinsTC','function':'missense','exon':'7'}]  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/2:24:14:14:1:0:3,3:5,5:0.357143,0.357143:3,3:0,0:0:1:3,5:2,0:0:0
chr1    876499  .   A   G   1538.18 PASS    AF=1;AO=143;DP=146;FAO=163;FDP=163;FR=.;FRO=0;FSAF=111;FSAR=52;FSRF=0;FSRR=0;FWDB=0.0205667;FXX=0.00609719;HRUN=1;LEN=1;MLLD=254.082;QD=37.7467;RBI=0.0338055;REFB=0;REVB=0.0268295;RO=0;SAF=102;SAR=41;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=5.20638E-6;OID=.;OPOS=876499;OREF=A;OALT=G;OMAPALT=G;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}]    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:73:146:163:0:0:143:163:1:41:102:0:0:52:111:0:0
chr1    877715  .   C   G   149.75  PASS    AF=0.944444;AO=17;DP=18;FAO=17;FDP=18;FR=.;FRO=1;FSAF=7;FSAR=10;FSRF=0;FSRR=1;FWDB=-0.0331191;FXX=0;HRUN=2;LEN=1;MLLD=59.9555;QD=33.2769;RBI=0.0842929;REFB=0.00735055;REVB=-0.0775139;RO=1;SAF=7;SAR=10;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.0436347;STB=0.523782;STBP=0.522;TYPE=snp;VARB=-0.00424313;OID=.;OPOS=877715;OREF=C;OALT=G;OMAPALT=G;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:3:18:18:1:1:17:17:0.944444:10:7:0:1:10:7:0:1
chr1    878314  .   G   C   127.32  PASS    AF=0.519231;AO=27;DP=53;FAO=27;FDP=52;FR=.;FRO=25;FSAF=13;FSAR=14;FSRF=15;FSRR=10;FWDB=0.0406788;FXX=0.0188644;HRUN=3;LEN=1;MLLD=110.174;QD=9.79389;RBI=0.042421;REFB=-0.0115464;REVB=0.0120322;RO=25;SAF=13;SAR=14;SRF=15;SRR=10;SSEN=0;SSEP=0;SSSB=-0.101816;STB=0.556776;STBP=0.417;TYPE=snp;VARB=0.0118638;OID=.;OPOS=878314;OREF=G;OALT=C;OMAPALT=C;FUNC=[{'normalizedRef':'G','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'480','location':'exonic','protein':'p.(=)','normalizedPos':'878314','normalizedAlt':'C','codon':'GGC','coding':'c.1440G>C','function':'synonymous','exon':'11'}]   GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:99:53:52:25:25:27:27:0.519231:14:13:15:10:14:13:15:10
ADD REPLY
1
Entering edit mode

The FUNC does not look consistent across records. You need to write a small function that parses the FUNC properly and returns appropriate information. I guess the root cause will the tool that annotated the variants and tweaking the parameters of that tool will solve the problem easily.

ADD REPLY
0
Entering edit mode

Ok then...I will see if I can handle just to do it by excel file...Im not a programer so isnt easy for me.

Thanks a lot for the tips! :)))

These people who provides these files could have make more easily!!! ^^"

ADD REPLY
1
1
Entering edit mode
6.1 years ago
kirannbishwa01 ★ 1.6k

Try this tool for simplify VCF:

https://github.com/everestial/vcf_simplify

ADD COMMENT

Login before adding your answer.

Traffic: 2668 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6