parsing vcf file
4
0
Entering edit mode
3.8 years ago
miss • 0

I have VCF file and part of it looks like this:

;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;


I need to extract Gene.refGene=NONE,DDX11L1 which is between semicolons, I also need to extract ExonicFunc.refGene=. and AAChange.refGene=. which are all also between semicolons.

I tried to do it like this:

import sys
import re

def parse_vcf(vcf_file):
pattern=re.compile(r'"([^;]*)"' , 'Gene.refGene')
f=open(vcf_file , 'r')
for line in f:
if pattern.search(line):
continue
return

if __name__ == '__main__':
vcf=sys.argv[1]
parse_vcf(vcf)


but it is not working. thank you for your suggestions.

vcf python • 3.0k views
6
Entering edit mode
3.8 years ago
import sys
import re

def parse_vcf(vcf_file):
with open(vcf_file, 'r') as vcf_f:
for line in vcf_f:
if line[0] != '#':
###Split the line by "tab" to keep info field (for me it's the 8th column, so choose the index = 7, I don't remember if it's always the 8th, change this if needed)
info_field_line = line.split("\t")[7]
###Split the info line by ";"
info_field_line_array = info_field_line.split(";")
###For each line of your VCF, create a dictionnary with this array key : info, value : value of this info
dict_info={}
for i in info_field_line_array:
###Just looking for line with "=" character (as key = value)
if "=" in i:
###Left from equal sign is key (Gene.refGene, ExonicFunc.refGene...)
key = i.split("=")[0]
###Right from equal sign is value (RBL1,synonymous_SNV...)
value = i.split("=")[1]
###Put them in a dictionnary
dict_info[key]=value
###After dictionnary creation, you can have an access to all your features (key), just by their name as follow
my_GeneRefGene = dict_info['Gene.refGene']
my_ExonicFuncRefGene = dict_info['ExonicFunc.refGene']
my_AAChangeRefGene = dict_info['AAChange.refGene']

print(my_GeneRefGene)
print(my_ExonicFuncRefGene)
print(my_AAChangeRefGene)
###This is the result for the first line, you can save the data in array to use it later or process line by line as you wish

if __name__ == '__main__':
vcf=sys.argv[1]
parse_vcf(vcf)

0
Entering edit mode

Thank you so much it is perfect! And I understand, thank you for comments also!!

3
Entering edit mode
3.8 years ago

First:

but it is not working

isn't a good error description.

Second:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

:)

I would do something like this:

1. split each vcf line by the column delimiter (tab)
2. take the column that contains the information you like to extract and split it by ";"
3. take the result of (2) and split it by "=". Store it in a dictionary. So you can just return result["Gene.refGene"] or any other value you like for the line.

fin swimmer

0
Entering edit mode

Thank you very much and sorry for bad error description

1
Entering edit mode
3.8 years ago
ATpoint 55k

You want to extract from column 8, so the INFO column right? If so, first get vcftools, and then use vcf-query:

vcf-query -f '%INFO/Gene.refGene\n' in.vcf


Same goes for the other information you want. \n is the delimiter in the output, and can be changed into any delim you want.

0
Entering edit mode

Thank you so much, it is working but sadly I have to do it with python script..

1
Entering edit mode

Then maybe have a look at PyVCF

1
Entering edit mode
3.8 years ago
erwan.scaon ▴ 830

If you don't stick with Python, I'll suggest awk :

echo ';Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;' > test.txt;
awk -F ";" '{print $2,$4, \$5}' test.txt;


Gene.refGene=NONE,DDX11L1 ExonicFunc.refGene=. AAChange.refGene=.

0
Entering edit mode

Thank you for suggestion, that way it would be much easier :)