import pysam,sys
myvcf=pysam.VariantFile(sys.argv[1],"r")
vcf_out = pysam.VariantFile(sys.argv[2], 'w', header=myvcf.header)
for variant in myvcf:
for key in variant.info.keys():
if not (key=="TYPE" or key=="FUNC"):
del variant.info[key]
vcf_out.write(variant)
Assumes single sample VCF. Modify according to your requirement. You need to work on changing the FUNC values to desired output. It should be straight forward if you know a bit of python.
thanks Goutham, I used python but only once to call virus sequence from unmapped files XD but was all done...I will try and see if I can manage to learn...
It can be more elegant, but for simplicity, assuming your 'FUNC' always follow same pattern,
import pysam,sys
myvcf=pysam.VariantFile(sys.argv[1],"r")
vcf_out = pysam.VariantFile(sys.argv[2], 'w', header=myvcf.header)
for variant in myvcf:
variant.info['FUNC']=variant.info['FUNC'][1].split(":")[1]
for key in variant.info.keys():
if not (key=="TYPE" or key=="FUNC"):
del variant.info[key]
vcf_out.write(variant)
These files I got them...I dont understand why sometimes appear so many genes? is that normal?
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 68928 . T <CNV> 100.0 PASS PRECISE=FALSE;SVTYPE=CNV;END=1262556;LEN=1193628;NUMTILES=615;CONFIDENCE=0;PRECISION=82.7875;FUNC=[{'gene':'OR4F5'},{'gene':'SAMD11'},{'gene':'NOC2L'},{'gene':'KLHL17'},{'gene':'PLEKHN1'},{'gene':'HES4'},{'gene':'ISG15'},{'gene':'AGRN'},{'gene':'RNF223'},{'gene':'C1orf159'},{'gene':'TTLL10'},{'gene':'TNFRSF18'},{'gene':'TNFRSF4'},{'gene':'SDF4'},{'gene':'B3GALT6'},{'gene':'FAM132A'},{'gene':'UBE2J2'},{'gene':'SCNN1D'},{'gene':'ACAP3'},{'gene':'MIR6726'},{'gene':'PUSL1'},{'gene':'CPSF3L'},{'gene':'MIR6727'},{'gene':'GLTPD1'}] GT:GQ:CN ./.:0:2
chr1 871334 . G T 1330.9 PASS AF=0.973154;AO=145;DP=149;FAO=145;FDP=149;FR=.;FRO=4;FSAF=90;FSAR=55;FSRF=2;FSRR=2;FWDB=0.0349465;FXX=0;HRUN=2;LEN=1;MLLD=53.9245;QD=35.7289;RBI=0.0349532;REFB=-0.0020043;REVB=6.80506E-4;RO=4;SAF=90;SAR=55;SRF=2;SRR=2;SSEN=0;SSEP=0;SSSB=0.0065273;STB=0.503434;STBP=0.677;TYPE=snp;VARB=2.72043E-4;OID=.;OPOS=871334;OREF=G;OALT=T;OMAPALT=T;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/1:36:149:149:4:4:145:145:0.973154:55:90:2:2:55:90:2:2
chr1 874815 . TCCCCCT CTCCCCT,TCCCCTC 104.84 PASS AF=0.357143,0.357143;AO=3,3;DP=14;FAO=5,5;FDP=14;FR=.,HEALED;FRO=0;FSAF=2,0;FSAR=3,5;FSRF=0;FSRR=0;FWDB=0.188436,0.0894205;FXX=0;HRUN=1,5;LEN=2,2;MLLD=17.4459,16.1569;QD=29.9537;RBI=0.229398,0.204689;REFB=0,0;REVB=0.130827,0.184124;RO=1;SAF=0,0;SAR=3,3;SRF=0;SRR=1;SSEN=0,0;SSEP=0,0;SSSB=0,0;STB=0.5,0.5;STBP=1,1;TYPE=mnp,mnp;VARB=0.138405,0.0578427;OID=.,.;OPOS=874815,874820;OREF=TC,CT;OALT=CT,TC;OMAPALT=CTCCCCT,TCCCCTC;FUNC=[{'normalizedRef':'TC','grantham':'74.0','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'227','location':'exonic','protein':'p.Pro228Ser','sift':'0.68','normalizedPos':'874815','normalizedAlt':'CT','polyphen':'0.98','coding':'c.681_682delTCinsCT','function':['synonymous', 'missense'],'exon':'7'},{'normalizedRef':'CT','grantham':'98.0','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'229','location':'exonic','protein':'p.Pro229Leu','sift':'0.03','normalizedPos':'874820','normalizedAlt':'TC','polyphen':'0.0','coding':'c.686_687delCTinsTC','function':'missense','exon':'7'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/2:24:14:14:1:0:3,3:5,5:0.357143,0.357143:3,3:0,0:0:1:3,5:2,0:0:0
chr1 876499 . A G 1538.18 PASS AF=1;AO=143;DP=146;FAO=163;FDP=163;FR=.;FRO=0;FSAF=111;FSAR=52;FSRF=0;FSRR=0;FWDB=0.0205667;FXX=0.00609719;HRUN=1;LEN=1;MLLD=254.082;QD=37.7467;RBI=0.0338055;REFB=0;REVB=0.0268295;RO=0;SAF=102;SAR=41;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=0;STB=0.5;STBP=1;TYPE=snp;VARB=5.20638E-6;OID=.;OPOS=876499;OREF=A;OALT=G;OMAPALT=G;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/1:73:146:163:0:0:143:163:1:41:102:0:0:52:111:0:0
chr1 877715 . C G 149.75 PASS AF=0.944444;AO=17;DP=18;FAO=17;FDP=18;FR=.;FRO=1;FSAF=7;FSAR=10;FSRF=0;FSRR=1;FWDB=-0.0331191;FXX=0;HRUN=2;LEN=1;MLLD=59.9555;QD=33.2769;RBI=0.0842929;REFB=0.00735055;REVB=-0.0775139;RO=1;SAF=7;SAR=10;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.0436347;STB=0.523782;STBP=0.522;TYPE=snp;VARB=-0.00424313;OID=.;OPOS=877715;OREF=C;OALT=G;OMAPALT=G;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/1:3:18:18:1:1:17:17:0.944444:10:7:0:1:10:7:0:1
chr1 878314 . G C 127.32 PASS AF=0.519231;AO=27;DP=53;FAO=27;FDP=52;FR=.;FRO=25;FSAF=13;FSAR=14;FSRF=15;FSRR=10;FWDB=0.0406788;FXX=0.0188644;HRUN=3;LEN=1;MLLD=110.174;QD=9.79389;RBI=0.042421;REFB=-0.0115464;REVB=0.0120322;RO=25;SAF=13;SAR=14;SRF=15;SRR=10;SSEN=0;SSEP=0;SSSB=-0.101816;STB=0.556776;STBP=0.417;TYPE=snp;VARB=0.0118638;OID=.;OPOS=878314;OREF=G;OALT=C;OMAPALT=C;FUNC=[{'normalizedRef':'G','transcript':'NM_152486.2','gene':'SAMD11','proteinPos':'480','location':'exonic','protein':'p.(=)','normalizedPos':'878314','normalizedAlt':'C','codon':'GGC','coding':'c.1440G>C','function':'synonymous','exon':'11'}] GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:99:53:52:25:25:27:27:0.519231:14:13:15:10:14:13:15:10
The FUNC does not look consistent across records. You need to write a small function that parses the FUNC properly and returns appropriate information. I guess the root cause will the tool that annotated the variants and tweaking the parameters of that tool will solve the problem easily.
These sites are too general but also may be helpful:
https://help.basespace.illumina.com/articles/descriptive/vcf-files/
https://www.bioconductor.org/packages/devel/bioc/vignettes/VariantAnnotation/inst/doc/filterVcf.pdf
http://gatkforums.broadinstitute.org/wdl/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it
thanks for the links, always nice to have to get more infor **