Question: VCF simplify INFO
0
gravatar for cristina_sabiers
2.1 years ago by
Spain
cristina_sabiers40 wrote:

Hi!

I would like to know if it is any way to simplify my INFO from my vcf file

For example instead have all of this

AF=0.305085;AO=17;DP=126;FAO=36;FDP=118;FR=.,HEALED;FRO=51;FSAF=2;FSAR=34;FSRF=35;FSRR=16;FWDB=-0.00471499;FXX=0.106053;HRUN=2;
LEN=1;MLLD=28.477;QD=9.93622;RBI=0.11082;REFB=-0.0194848;REVB=0.110719;RO=44;SAF=1;SAR=16;SRF=33;SRR=11;SSEN=0;SSEP=0;SSSB=-0.761851;STB=0.926095;STBP=0;
TYPE=snp;VARB=0.0379246;OID=.;OPOS=3421642;OREF=C;OALT=A;OMAPALT=A;FUNC=[{'transcript':'NM_001409.3','gene':'MEGF6','location':'intronic'}];SF=0

I want just type=snp gene and location

Thanks!

vcf • 1.1k views
ADD COMMENTlink modified 7 months ago by kirannbishwa01860 • written 2.1 years ago by cristina_sabiers40

These sites are too general but also may be helpful:

https://help.basespace.illumina.com/articles/descriptive/vcf-files/

https://www.bioconductor.org/packages/devel/bioc/vignettes/VariantAnnotation/inst/doc/filterVcf.pdf

http://gatkforums.broadinstitute.org/wdl/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by natasha.sernova3.0k

thanks for the links, always nice to have to get more infor **

ADD REPLYlink written 2.1 years ago by cristina_sabiers40
2
gravatar for geek_y
2.1 years ago by
geek_y8.7k
geek_y8.7k wrote:
import pysam,sys

myvcf=pysam.VariantFile(sys.argv[1],"r")
vcf_out = pysam.VariantFile(sys.argv[2], 'w', header=myvcf.header)

for variant in myvcf:
        for key in variant.info.keys():
                if not (key=="TYPE" or key=="FUNC"):
                        del variant.info[key]
        vcf_out.write(variant)

Assumes single sample VCF. Modify according to your requirement. You need to work on changing the FUNC values to desired output. It should be straight forward if you know a bit of python.

usage: python script.py in.vcf out.vcf

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by geek_y8.7k

thanks Goutham, I used python but only once to call virus sequence from unmapped files XD but was all done...I will try and see if I can manage to learn...

ADD REPLYlink written 2.1 years ago by cristina_sabiers40

THANKS!!!! WORKS LIKE A CHARM!!!!!! **

Hope one day I can learn Phyton

ADD REPLYlink written 2.1 years ago by cristina_sabiers40

I think I need a bit more help...

How I can change FUNC values? so they just show out gene names and not the rest like I have now?

FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}]

Thanks...I tried to figure out how to do it...but no way :(

ADD REPLYlink modified 2.1 years ago by geek_y8.7k • written 2.1 years ago by cristina_sabiers40

Can you post the output of following command ?

grep "^#" in.vcf  | grep "FUNC"

To change the value of INFO field, we need to see what format it is in.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by geek_y8.7k

##INFO=<ID=FUNC,Number=.,Type=String,Description="Functional Annotations">

I get that :) Thanks for the help

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by cristina_sabiers40
1

It can be more elegant, but for simplicity, assuming your 'FUNC' always follow same pattern,

import pysam,sys

myvcf=pysam.VariantFile(sys.argv[1],"r")
vcf_out = pysam.VariantFile(sys.argv[2], 'w', header=myvcf.header)

for variant in myvcf:
        variant.info['FUNC']=variant.info['FUNC'][1].split(":")[1]
        for key in variant.info.keys():
                if not (key=="TYPE" or key=="FUNC"):
                        del variant.info[key]
        vcf_out.write(variant)

The output will be:

FUNC='MEGF6';TYPE='snp'
ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by geek_y8.7k

Thanks a lot Goutham!!!

I tried just now after wake up :P

I get half of it, sometimes appear the gen other another values, I also get this error message

python INFO.py 8.vcf outwork.vcf
Traceback (most recent call last):
File "INFO.py", line 7, in <module>
variant.info['FUNC']=variant.info['FUNC'][1].split(":")[1]
 File "pysam/cbcf.pyx", line 2204, in pysam.cbcf.VariantRecordInfo.__getitem__ (pysam/cbcf.c:29799)
 KeyError: 'Invalid INFO field: FUNC'

a

chr1    871334  .   G   T   1330.9  PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:36:149:149:4:4:145:145:0.973154:55:90:2:2:55:90:2:2
chr1    874815  .   TCCCCCT CTCCCCT,TCCCCTC 104.84  PASS    TYPE=mnp,mnp;FUNC='74.0'    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/2:24:14:14:1:0:3,3:5,5:0.357143,0.357143:3,3:0,0:0:1:3,5:2,0:0:0
chr1    876499  .   A   G   1538.18 PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:73:146:163:0:0:143:163:1:41:102:0:0:52:111:0:0
chr1    877715  .   C   G   149.75  PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:3:18:18:1:1:17:17:0.944444:10:7:0:1:10:7:0:1
chr1    878314  .   G   C   127.32  PASS    TYPE=snp;FUNC='NM_152486.2' GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:99:53:52:25:25:27:27:0.519231:14:13:15:10:14:13:15:10
chr1    880238  .   A   G   940.71  PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:44:98:98:0:0:98:98:1:69:29:0:0:69:29:0:0
chr1    881627  .   G   A   1840.5  PASS    TYPE=snp;FUNC='NM_015658.3' GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:59:201:204:3:4:196:200:0.980392:96:100:2:1:98:102:3:1
ADD REPLYlink modified 2.1 years ago by geek_y8.7k • written 2.1 years ago by cristina_sabiers40

so there are variants without FUNC ?

ADD REPLYlink written 2.1 years ago by geek_y8.7k

in this case the original is like this:

These files I got them...I dont understand why sometimes appear so many genes? is that normal?

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  
chr1    68928   .   T   <CNV>   100.0   PASS    PRECISE=FALSE;SVTYPE=CNV;END=1262556;LEN=1193628;NUMTILES=615;CONFIDENCE=0;PRECISION=82.7875;FUNC=[{'gene':'OR4F5'},{'gene':'SAMD11'},{'gene':'NOC2L'},{'gene':'KLHL17'},{'gene':'PLEKHN1'},{'gene':'HES4'},{'gene':'ISG15'},{'gene':'AGRN'},{'gene':'RNF223'},{'gene':'C1orf159'},{'gene':'TTLL10'},{'gene':'TNFRSF18'},{'gene':'TNFRSF4'},{'gene':'SDF4'},{'gene':'B3GALT6'},{'gene':'FAM132A'},{'gene':'UBE2J2'},{'gene':'SCNN1D'},{'gene':'ACAP3'},{'gene':'MIR6726'},{'gene':'PUSL1'},{'gene':'CPSF3L'},{'gene':'MIR6727'},{'gene':'GLTPD1'}]  GT:GQ:CN    ./.:0:2
chr1    871334  .   G   T   1330.9  PASS    AF=0.973154;AO=145;DP=149;FAO=145;FDP=149;FR=.;FRO=4;FSAF=90;FSAR=55;FSRF=2;FSRR=2;FWDB=0.0349465;FXX=0;HRUN=2;LEN=1;MLLD=53.9245;QD=35.7289;RBI=0.0349532;REFB=-0.0020043;REVB=6.80506E-4;RO=4;SAF=90;SAR=55;SRF=2;SRR=2;SSEN=0;SSEP=0;SSSB=0.0065273;STB=0.503434;STBP=0.677;TYPE=snp;VARB=2.72043E-4;OID=.;OPOS=871334;OREF=G;OALT=T;OMAPALT=T;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:36:149:149:4:4:145:145:0.973154:55:90:2:2:55:90:2:2
chr1    874815  .   TCCCCCT CTCCCCT,TCCCCTC 104.84  PASS    AF=0.357143,0.357143;AO=3,3;DP=14;FAO=5,5;FDP=14;FR=.,HEALED;FRO=0;FSAF=2,0;FSAR=3,5;FSRF=0;FSRR=0;FWDB=0.188436,0.0894205;FXX=0;HRUN=1,5;LEN=2,2;MLLD=17.4459,16.1569;QD=29.9537;RBI=0.229398,0.204689;REFB=0,0;REVB=0.130827,0.184124;RO=1;SAF=0,0;SAR=3,3;SRF=0;SRR=1;SSEN=0,0;SSEP=0,0;SSSB=0,0;STB=0.5,0.5;STBP=1,1;TYPE=mnp,mnp;VARB=0.138405,0.0578427;OID=.,.;OPOS=874815,874820;OREF=TC,CT;OALT=CT,TC;OMAPALT=CTCCCCT,TCCCCTC;FUNC=[{'normalizedRef':'TC','grantham':'74.0','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'227','location':'exonic','protein':'p.Pro228Ser','sift':'0.68','normalizedPos':'874815','normalizedAlt':'CT','polyphen':'0.98','coding':'c.681_682delTCinsCT','function':['synonymous', 'missense'],'exon':'7'},{'normalizedRef':'CT','grantham':'98.0','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'229','location':'exonic','protein':'p.Pro229Leu','sift':'0.03','normalizedPos':'874820','normalizedAlt':'TC','polyphen':'0.0','coding':'c.686_687delCTinsTC','function':'missense','exon':'7'}]  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/2:24:14:14:1:0:3,3:5,5:0.357143,0.357143:3,3:0,0:0:1:3,5:2,0:0:0
chr1    876499  .   A   G   1538.18 PASS    AF=1;AO=143;DP=146;FAO=163;FDP=163;FR=.;FRO=0;FSAF=111;FSAR=52;FSRF=0;FSRR=0;FWDB=0.0205667;FXX=0.00609719;HRUN=1;LEN=1;MLLD=254.082;QD=37.7467;RBI=0.0338055;REFB=0;REVB=0.0268295;RO=0;SAF=102;SAR=41;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=5.20638E-6;OID=.;OPOS=876499;OREF=A;OALT=G;OMAPALT=G;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}]    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:73:146:163:0:0:143:163:1:41:102:0:0:52:111:0:0
chr1    877715  .   C   G   149.75  PASS    AF=0.944444;AO=17;DP=18;FAO=17;FDP=18;FR=.;FRO=1;FSAF=7;FSAR=10;FSRF=0;FSRR=1;FWDB=-0.0331191;FXX=0;HRUN=2;LEN=1;MLLD=59.9555;QD=33.2769;RBI=0.0842929;REFB=0.00735055;REVB=-0.0775139;RO=1;SAF=7;SAR=10;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.0436347;STB=0.523782;STBP=0.522;TYPE=snp;VARB=-0.00424313;OID=.;OPOS=877715;OREF=C;OALT=G;OMAPALT=G;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:3:18:18:1:1:17:17:0.944444:10:7:0:1:10:7:0:1
chr1    878314  .   G   C   127.32  PASS    AF=0.519231;AO=27;DP=53;FAO=27;FDP=52;FR=.;FRO=25;FSAF=13;FSAR=14;FSRF=15;FSRR=10;FWDB=0.0406788;FXX=0.0188644;HRUN=3;LEN=1;MLLD=110.174;QD=9.79389;RBI=0.042421;REFB=-0.0115464;REVB=0.0120322;RO=25;SAF=13;SAR=14;SRF=15;SRR=10;SSEN=0;SSEP=0;SSSB=-0.101816;STB=0.556776;STBP=0.417;TYPE=snp;VARB=0.0118638;OID=.;OPOS=878314;OREF=G;OALT=C;OMAPALT=C;FUNC=[{'normalizedRef':'G','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'480','location':'exonic','protein':'p.(=)','normalizedPos':'878314','normalizedAlt':'C','codon':'GGC','coding':'c.1440G>C','function':'synonymous','exon':'11'}]   GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:99:53:52:25:25:27:27:0.519231:14:13:15:10:14:13:15:10
ADD REPLYlink modified 2.1 years ago by geek_y8.7k • written 2.1 years ago by cristina_sabiers40
1

The FUNC does not look consistent across records. You need to write a small function that parses the FUNC properly and returns appropriate information. I guess the root cause will the tool that annotated the variants and tweaking the parameters of that tool will solve the problem easily.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by geek_y8.7k

Ok then...I will see if I can handle just to do it by excel file...Im not a programer so isnt easy for me.

Thanks a lot for the tips! :)))

These people who provides these files could have make more easily!!! ^^"

ADD REPLYlink written 2.1 years ago by cristina_sabiers40
1
gravatar for natasha.sernova
2.1 years ago by
natasha.sernova3.0k
natasha.sernova3.0k wrote:

See these posts:

SNP to gene mapping into reference genome

What are SNP effects? in terms of SNPeff output file

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by natasha.sernova3.0k
1
gravatar for kirannbishwa01
7 months ago by
United States
kirannbishwa01860 wrote:

Try this tool for simplify VCF:

https://github.com/everestial/vcf_simplify

ADD COMMENTlink written 7 months ago by kirannbishwa01860
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1814 users visited in the last hour