Question: parsing vcf file
0
gravatar for miss
4 months ago by
miss0
miss0 wrote:

I have VCF file and part of it looks like this:

;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;

I need to extract Gene.refGene=NONE,DDX11L1 which is between semicolons, I also need to extract ExonicFunc.refGene=. and AAChange.refGene=. which are all also between semicolons.

I tried to do it like this:

import sys
import re


def parse_vcf(vcf_file):
    pattern=re.compile(r'"([^;]*)"' , 'Gene.refGene')
    f=open(vcf_file , 'r')
    for line in f:
        if pattern.search(line):
            continue
    return

if __name__ == '__main__':
    vcf=sys.argv[1]
    parse_vcf(vcf)

but it is not working. thank you for your suggestions.``

python vcf • 334 views
ADD COMMENTlink modified 4 months ago by Ram15k • written 4 months ago by miss0
4
gravatar for Bastien Hervé
4 months ago by
Bastien Hervé1.3k
Limoges, CBRS, France
Bastien Hervé1.3k wrote:
import sys
import re

def parse_vcf(vcf_file):
    ###Open your file
    with open(vcf_file, 'r') as vcf_f:
        for line in vcf_f:
            ###Skip metadata lines
            if line[0] != '#':
                ###Split the line by "tab" to keep info field (for me it's the 8th column, so choose the index = 7, I don't remember if it's always the 8th, change this if needed)
                info_field_line = line.split("\t")[7]
                ###Split the info line by ";"
                info_field_line_array = info_field_line.split(";")
                ###For each line of your VCF, create a dictionnary with this array key : info, value : value of this info
                dict_info={}
                for i in info_field_line_array:
                    ###Just looking for line with "=" character (as key = value)
                    if "=" in i:
                        ###Left from equal sign is key (Gene.refGene, ExonicFunc.refGene...)
                        key = i.split("=")[0]
                        ###Right from equal sign is value (RBL1,synonymous_SNV...)
                        value = i.split("=")[1]
                        ###Put them in a dictionnary
                        dict_info[key]=value
                ###After dictionnary creation, you can have an access to all your features (key), just by their name as follow
                my_GeneRefGene = dict_info['Gene.refGene']
                my_ExonicFuncRefGene = dict_info['ExonicFunc.refGene']
                my_AAChangeRefGene = dict_info['AAChange.refGene']

                print(my_GeneRefGene)
                print(my_ExonicFuncRefGene)
                print(my_AAChangeRefGene)
                ###This is the result for the first line, you can save the data in array to use it later or process line by line as you wish

if __name__ == '__main__':
    vcf=sys.argv[1]
    parse_vcf(vcf)
ADD COMMENTlink modified 3 months ago • written 4 months ago by Bastien Hervé1.3k

Thank you so much it is perfect! And I understand, thank you for comments also!!

ADD REPLYlink written 4 months ago by miss0
3
gravatar for finswimmer
4 months ago by
finswimmer2.8k
Germany
finswimmer2.8k wrote:

First:

but it is not working

isn't a good error description.

Second:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

:)

I would do something like this:

  1. split each vcf line by the column delimiter (tab)
  2. take the column that contains the information you like to extract and split it by ";"
  3. take the result of (2) and split it by "=". Store it in a dictionary. So you can just return result["Gene.refGene"] or any other value you like for the line.

fin swimmer

ADD COMMENTlink written 4 months ago by finswimmer2.8k

Thank you very much and sorry for bad error description

ADD REPLYlink written 4 months ago by miss0
1
gravatar for ATpoint
4 months ago by
ATpoint4.4k
Germany
ATpoint4.4k wrote:

You want to extract from column 8, so the INFO column right? If so, first get vcftools, and then use vcf-query:

vcf-query -f '%INFO/Gene.refGene\n' in.vcf

Same goes for the other information you want. \n is the delimiter in the output, and can be changed into any delim you want.

ADD COMMENTlink written 4 months ago by ATpoint4.4k

Thank you so much, it is working but sadly I have to do it with python script..

ADD REPLYlink written 4 months ago by miss0
1

Then maybe have a look at PyVCF

ADD REPLYlink written 4 months ago by ATpoint4.4k
1
gravatar for erwan.scaon
4 months ago by
erwan.scaon540
Limoges - CBRS - France
erwan.scaon540 wrote:

If you don't stick with Python, I'll suggest awk :

echo ';Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;' > test.txt;
awk -F ";" '{print $2, $4, $5}' test.txt;

Gene.refGene=NONE,DDX11L1 ExonicFunc.refGene=. AAChange.refGene=.

ADD COMMENTlink written 4 months ago by erwan.scaon540

Thank you for suggestion, that way it would be much easier :)

ADD REPLYlink written 4 months ago by miss0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1841 users visited in the last hour