Question

VCF simplify INFO

0

Entering edit mode

9.2 years ago

cristina_sabiers ▴ 110

Hi!

I would like to know if it is any way to simplify my INFO from my vcf file

For example instead have all of this

AF=0.305085;AO=17;DP=126;FAO=36;FDP=118;FR=.,HEALED;FRO=51;FSAF=2;FSAR=34;FSRF=35;FSRR=16;FWDB=-0.00471499;FXX=0.106053;HRUN=2;
LEN=1;MLLD=28.477;QD=9.93622;RBI=0.11082;REFB=-0.0194848;REVB=0.110719;RO=44;SAF=1;SAR=16;SRF=33;SRR=11;SSEN=0;SSEP=0;SSSB=-0.761851;STB=0.926095;STBP=0;
TYPE=snp;VARB=0.0379246;OID=.;OPOS=3421642;OREF=C;OALT=A;OMAPALT=A;FUNC=[{'transcript':'NM_001409.3','gene':'MEGF6','location':'intronic'}];SF=0

I want just type=snp gene and location

Thanks!

vcf • 5.1k views

ADD COMMENT • link updated 7.7 years ago by kirannbishwa01 ★ 1.6k • written 9.2 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

These sites are too general but also may be helpful:

https://help.basespace.illumina.com/articles/descriptive/vcf-files/

https://www.bioconductor.org/packages/devel/bioc/vignettes/VariantAnnotation/inst/doc/filterVcf.pdf

http://gatkforums.broadinstitute.org/wdl/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it

ADD REPLY • link 9.2 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

thanks for the links, always nice to have to get more infor **

ADD REPLY • link 9.2 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

9.2 years ago

natasha.sernova ★ 4.0k

See these posts:

SNP to gene mapping into reference genome

What are SNP effects? in terms of SNPeff output file

ADD COMMENT • link 9.2 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

7.7 years ago

kirannbishwa01 ★ 1.6k

Try this tool for simplify VCF:

https://github.com/everestial/vcf_simplify

ADD COMMENT • link 7.7 years ago by kirannbishwa01 ★ 1.6k

GouthamAtla · Accepted Answer · 2016-08-22

2

Entering edit mode

9.2 years ago

GouthamAtla 12k

import pysam,sys

myvcf=pysam.VariantFile(sys.argv[1],"r")
vcf_out = pysam.VariantFile(sys.argv[2], 'w', header=myvcf.header)

for variant in myvcf:
        for key in variant.info.keys():
                if not (key=="TYPE" or key=="FUNC"):
                        del variant.info[key]
        vcf_out.write(variant)

Assumes single sample VCF. Modify according to your requirement. You need to work on changing the FUNC values to desired output. It should be straight forward if you know a bit of python.

usage: python script.py in.vcf out.vcf

ADD COMMENT • link 9.2 years ago by GouthamAtla 12k

0

Entering edit mode

thanks Goutham, I used python but only once to call virus sequence from unmapped files XD but was all done...I will try and see if I can manage to learn...

ADD REPLY • link 9.2 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

THANKS!!!! WORKS LIKE A CHARM!!!!!! **

Hope one day I can learn Phyton

ADD REPLY • link 9.2 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

I think I need a bit more help...

How I can change FUNC values? so they just show out gene names and not the rest like I have now?

FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}]

Thanks...I tried to figure out how to do it...but no way :(

ADD REPLY • link updated 9.2 years ago by GouthamAtla 12k • written 9.2 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Can you post the output of following command ?

grep "^#" in.vcf  | grep "FUNC"

To change the value of INFO field, we need to see what format it is in.

ADD REPLY • link 9.2 years ago by GouthamAtla 12k

0

Entering edit mode

##INFO=<ID=FUNC,Number=.,Type=String,Description="Functional Annotations">

I get that :) Thanks for the help

ADD REPLY • link 9.2 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

It can be more elegant, but for simplicity, assuming your 'FUNC' always follow same pattern,

import pysam,sys

myvcf=pysam.VariantFile(sys.argv[1],"r")
vcf_out = pysam.VariantFile(sys.argv[2], 'w', header=myvcf.header)

for variant in myvcf:
        variant.info['FUNC']=variant.info['FUNC'][1].split(":")[1]
        for key in variant.info.keys():
                if not (key=="TYPE" or key=="FUNC"):
                        del variant.info[key]
        vcf_out.write(variant)

The output will be:

FUNC='MEGF6';TYPE='snp'

ADD REPLY • link 9.2 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks a lot Goutham!!!

I tried just now after wake up :P

I get half of it, sometimes appear the gen other another values, I also get this error message

python INFO.py 8.vcf outwork.vcf
Traceback (most recent call last):
File "INFO.py", line 7, in <module>
variant.info['FUNC']=variant.info['FUNC'][1].split(":")[1]
 File "pysam/cbcf.pyx", line 2204, in pysam.cbcf.VariantRecordInfo.__getitem__ (pysam/cbcf.c:29799)
 KeyError: 'Invalid INFO field: FUNC'

a

chr1    871334  .   G   T   1330.9  PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:36:149:149:4:4:145:145:0.973154:55:90:2:2:55:90:2:2
chr1    874815  .   TCCCCCT CTCCCCT,TCCCCTC 104.84  PASS    TYPE=mnp,mnp;FUNC='74.0'    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/2:24:14:14:1:0:3,3:5,5:0.357143,0.357143:3,3:0,0:0:1:3,5:2,0:0:0
chr1    876499  .   A   G   1538.18 PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:73:146:163:0:0:143:163:1:41:102:0:0:52:111:0:0
chr1    877715  .   C   G   149.75  PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:3:18:18:1:1:17:17:0.944444:10:7:0:1:10:7:0:1
chr1    878314  .   G   C   127.32  PASS    TYPE=snp;FUNC='NM_152486.2' GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:99:53:52:25:25:27:27:0.519231:14:13:15:10:14:13:15:10
chr1    880238  .   A   G   940.71  PASS    TYPE=snp;FUNC='SAMD11'  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:44:98:98:0:0:98:98:1:69:29:0:0:69:29:0:0
chr1    881627  .   G   A   1840.5  PASS    TYPE=snp;FUNC='NM_015658.3' GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:59:201:204:3:4:196:200:0.980392:96:100:2:1:98:102:3:1

ADD REPLY • link updated 9.2 years ago by GouthamAtla 12k • written 9.2 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

so there are variants without FUNC ?

ADD REPLY • link 9.2 years ago by GouthamAtla 12k

0

Entering edit mode

in this case the original is like this:

These files I got them...I dont understand why sometimes appear so many genes? is that normal?

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  
chr1    68928   .   T   <CNV>   100.0   PASS    PRECISE=FALSE;SVTYPE=CNV;END=1262556;LEN=1193628;NUMTILES=615;CONFIDENCE=0;PRECISION=82.7875;FUNC=[{'gene':'OR4F5'},{'gene':'SAMD11'},{'gene':'NOC2L'},{'gene':'KLHL17'},{'gene':'PLEKHN1'},{'gene':'HES4'},{'gene':'ISG15'},{'gene':'AGRN'},{'gene':'RNF223'},{'gene':'C1orf159'},{'gene':'TTLL10'},{'gene':'TNFRSF18'},{'gene':'TNFRSF4'},{'gene':'SDF4'},{'gene':'B3GALT6'},{'gene':'FAM132A'},{'gene':'UBE2J2'},{'gene':'SCNN1D'},{'gene':'ACAP3'},{'gene':'MIR6726'},{'gene':'PUSL1'},{'gene':'CPSF3L'},{'gene':'MIR6727'},{'gene':'GLTPD1'}]  GT:GQ:CN    ./.:0:2
chr1    871334  .   G   T   1330.9  PASS    AF=0.973154;AO=145;DP=149;FAO=145;FDP=149;FR=.;FRO=4;FSAF=90;FSAR=55;FSRF=2;FSRR=2;FWDB=0.0349465;FXX=0;HRUN=2;LEN=1;MLLD=53.9245;QD=35.7289;RBI=0.0349532;REFB=-0.0020043;REVB=6.80506E-4;RO=4;SAF=90;SAR=55;SRF=2;SRR=2;SSEN=0;SSEP=0;SSSB=0.0065273;STB=0.503434;STBP=0.677;TYPE=snp;VARB=2.72043E-4;OID=.;OPOS=871334;OREF=G;OALT=T;OMAPALT=T;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:36:149:149:4:4:145:145:0.973154:55:90:2:2:55:90:2:2
chr1    874815  .   TCCCCCT CTCCCCT,TCCCCTC 104.84  PASS    AF=0.357143,0.357143;AO=3,3;DP=14;FAO=5,5;FDP=14;FR=.,HEALED;FRO=0;FSAF=2,0;FSAR=3,5;FSRF=0;FSRR=0;FWDB=0.188436,0.0894205;FXX=0;HRUN=1,5;LEN=2,2;MLLD=17.4459,16.1569;QD=29.9537;RBI=0.229398,0.204689;REFB=0,0;REVB=0.130827,0.184124;RO=1;SAF=0,0;SAR=3,3;SRF=0;SRR=1;SSEN=0,0;SSEP=0,0;SSSB=0,0;STB=0.5,0.5;STBP=1,1;TYPE=mnp,mnp;VARB=0.138405,0.0578427;OID=.,.;OPOS=874815,874820;OREF=TC,CT;OALT=CT,TC;OMAPALT=CTCCCCT,TCCCCTC;FUNC=[{'normalizedRef':'TC','grantham':'74.0','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'227','location':'exonic','protein':'p.Pro228Ser','sift':'0.68','normalizedPos':'874815','normalizedAlt':'CT','polyphen':'0.98','coding':'c.681_682delTCinsCT','function':['synonymous', 'missense'],'exon':'7'},{'normalizedRef':'CT','grantham':'98.0','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'229','location':'exonic','protein':'p.Pro229Leu','sift':'0.03','normalizedPos':'874820','normalizedAlt':'TC','polyphen':'0.0','coding':'c.686_687delCTinsTC','function':'missense','exon':'7'}]  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/2:24:14:14:1:0:3,3:5,5:0.357143,0.357143:3,3:0,0:0:1:3,5:2,0:0:0
chr1    876499  .   A   G   1538.18 PASS    AF=1;AO=143;DP=146;FAO=163;FDP=163;FR=.;FRO=0;FSAF=111;FSAR=52;FSRF=0;FSRR=0;FWDB=0.0205667;FXX=0.00609719;HRUN=1;LEN=1;MLLD=254.082;QD=37.7467;RBI=0.0338055;REFB=0;REVB=0.0268295;RO=0;SAF=102;SAR=41;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=5.20638E-6;OID=.;OPOS=876499;OREF=A;OALT=G;OMAPALT=G;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}]    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:73:146:163:0:0:143:163:1:41:102:0:0:52:111:0:0
chr1    877715  .   C   G   149.75  PASS    AF=0.944444;AO=17;DP=18;FAO=17;FDP=18;FR=.;FRO=1;FSAF=7;FSAR=10;FSRF=0;FSRR=1;FWDB=-0.0331191;FXX=0;HRUN=2;LEN=1;MLLD=59.9555;QD=33.2769;RBI=0.0842929;REFB=0.00735055;REVB=-0.0775139;RO=1;SAF=7;SAR=10;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.0436347;STB=0.523782;STBP=0.522;TYPE=snp;VARB=-0.00424313;OID=.;OPOS=877715;OREF=C;OALT=G;OMAPALT=G;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   1/1:3:18:18:1:1:17:17:0.944444:10:7:0:1:10:7:0:1
chr1    878314  .   G   C   127.32  PASS    AF=0.519231;AO=27;DP=53;FAO=27;FDP=52;FR=.;FRO=25;FSAF=13;FSAR=14;FSRF=15;FSRR=10;FWDB=0.0406788;FXX=0.0188644;HRUN=3;LEN=1;MLLD=110.174;QD=9.79389;RBI=0.042421;REFB=-0.0115464;REVB=0.0120322;RO=25;SAF=13;SAR=14;SRF=15;SRR=10;SSEN=0;SSEP=0;SSSB=-0.101816;STB=0.556776;STBP=0.417;TYPE=snp;VARB=0.0118638;OID=.;OPOS=878314;OREF=G;OALT=C;OMAPALT=C;FUNC=[{'normalizedRef':'G','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'480','location':'exonic','protein':'p.(=)','normalizedPos':'878314','normalizedAlt':'C','codon':'GGC','coding':'c.1440G>C','function':'synonymous','exon':'11'}]   GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:99:53:52:25:25:27:27:0.519231:14:13:15:10:14:13:15:10

ADD REPLY • link updated 9.2 years ago by GouthamAtla 12k • written 9.2 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

The FUNC does not look consistent across records. You need to write a small function that parses the FUNC properly and returns appropriate information. I guess the root cause will the tool that annotated the variants and tweaking the parameters of that tool will solve the problem easily.

ADD REPLY • link 9.2 years ago by GouthamAtla 12k

0

Entering edit mode

Ok then...I will see if I can handle just to do it by excel file...Im not a programer so isnt easy for me.

Thanks a lot for the tips! :)))

These people who provides these files could have make more easily!!! ^^"

ADD REPLY • link 9.2 years ago by cristina_sabiers ▴ 110