Question: VCF file help
0
gravatar for aurora
7 months ago by
aurora0
Denmark
aurora0 wrote:

Hello Biostars,

I am doing project with vcf files and I need your help. After annotation in Annovar I used python to extract gene names, exonic function and aa change. I want to remove from output genes that do not have exonic function and aa change (ExonicFunc.refGene = . ; AAChange.refGene = . ) and also there are cases in which I have two or three genes i.e. refGene=HNRNPCL1,HNRNPCL3,HNRNPCL4 and multiple transcripts. I want to take just first gene and from its AAChange.refGene in which I have for example HNRNPCL3:NM_001146181:exon1:c.A764G:p.D255G I want just last part (p.D255G). How can I do it with python?

Thank you for your help :)

script python vcf • 401 views
ADD COMMENTlink modified 7 months ago by Bastien Hervé1.8k • written 7 months ago by aurora0
1
gravatar for Bastien Hervé
7 months ago by
Bastien Hervé1.8k
Limoges, CBRS, France
Bastien Hervé1.8k wrote:

Seems to work as your ask for me, this wil create a new vcf file, deleting line with (ExonicFunc.refGene=. AND AAChange.refGene=.). If only ExonicFunc.refGene or AAChange.refGene is not "." line is printed.

Your thread is close to this one ( C: parsing vcf file ).

Managing vcf is a very popular question and the python codes, to process it, are very close. Try to get inspiration from similar thread to do what you want to do.

###Open your vcf file
new_vcf_file = open("your_new_vcf_file.vcf", "a")
with open("your_vcf_to_modif.vcf", 'r') as vcf_f:
    ###Read it line by line
    for line in vcf_f:
        check_up_line=True
        ###Skip metadata informations
        if line[0] != '#':
            ###Split the line by "tab" to keep info field (for me it's the 8th column, so choose the index = 7, I don't remember if it's always the 8th, change this if needed)
            info_field_line = line.split("\t")[7]
            ###Split the info line by ";"
            info_field_line_array = info_field_line.split(";")
            ###For each line of your VCF, create a dictionnary with this array key : info, value : value of this info
            dict_info={}
            for i in info_field_line_array:
                ###Just looking for line with "=" character (as key = value)
                if "=" in i:
                    ###Left from equal sign is key
                    key = i.split("=")[0]
                    ###Right from equal sign is value
                    value = i.split("=")[1]
                    ###Put them in a dictionnary
                    dict_info[key]=value

            ###After dictionnary creation, you can have an access to all your features (key), just by their name as follow

            ###Check first, if your line can be output (pass your filter (ExonicFunc.refGene AND AAChange.refGene not null))
            if dict_info['ExonicFunc.refGene'] == "." and dict_info['AAChange.refGene'] == ".":
                ###If it does not, put the variable at False, then it would not be print in the new vcf file
                check_up_line=False

            ###If AAChange.refGene item is not null (so you have something in it)
            if dict_info['AAChange.refGene'] != ".":
                ###Take the last info of the AAChange.refGene item and replace it in the actual line
                line = line.replace(dict_info['AAChange.refGene'], dict_info['AAChange.refGene'].split(":")[-1])

            ###If Gene.refGene item has many info (more than one)
           if len(dict_info['Gene.refGene'].split(","))>1:
                ###Take the first info of the Gene.refGene item and replace it in the actual line
                line = line.replace(dict_info['Gene.refGene'], dict_info['Gene.refGene'].split(",")[0])

            ###If the line pass your check up, print it
            if check_up_line:
                new_vcf_file.write(line)

        ###Write unchanged metadata
        else:
            new_vcf_file.write(line)
new_vcf_file.close()
ADD COMMENTlink modified 7 months ago • written 7 months ago by Bastien Hervé1.8k

Oh I did't see this related post.. thank you, for helping :) So if I combine these two scripts I should obtain something like:

import sys
import re

def parse_vcf(vcf_file):
    with open(vcf_file, 'r+') as vcf_f:
        for line in vcf_f:
            if line[0] != '#':
                info_field_line = line.split("\t")[7]
                info_field_line_array = info_field_line.split(";")
                dict_info = {}
                for i in info_field_line_array:
                    if "=" in i:
                        key = i.split("=")[0]
                        value = i.split("=")[1]
                        dict_info[key]=value
                my_GeneRefGene = dict_info['Gene.refGene']
                my_ExonicFuncRefGene = dict_info['ExonicFunc.refGene']
                my_AAChangeRefGene = dict_info['AAChange.refGene']
                if dict_info['ExonicFunc.refGene'] == "." and dict_info['AAChange.refGene'] == ".":
                    check_up_line=False
                if dict_info['AAChange.refGene'] != ".":
                    line = line.replace(dict_info['AAChange.refGene'], dict_info['AAChange.refGene'].split(":")[-1])
                if len(dict_info['Gene.refGene'].split(","))>1:
                    line = line.replace(dict_info['Gene.refGene'], dict_info['Gene.refGene'].split(",")[0])
                if check_up_line:
                    vcf_f.write(line)
                else:
                    vcf_f.write(line)

                print(my_GeneRefGene),(my_ExonicFuncRefGene),(my_AAChangeRefGene)

if __name__  == '__main__':
    vcf=sys.argv[1]
    parse_vcf(vcf)

vcf_f.close()

right?

ADD REPLYlink modified 7 months ago • written 7 months ago by aurora0

You do not need to use the script from the other thread, that was just an hint for you to create your own solution. The script I posted here do the job you want.

ADD REPLYlink written 7 months ago by Bastien Hervé1.8k

Sorry I thought that I need to combine them because it is showing me error info_field_line = line.split("\t")[7] IndexError: list index out of range and I tried with other indexes, still the same

ADD REPLYlink written 6 months ago by aurora0

Can you copy/paste some lines of your vcf below please Also add a

print(line)
break

after the line

if line[0] != '#':

and send me the result please

ADD REPLYlink modified 6 months ago • written 6 months ago by Bastien Hervé1.8k

After I add that line the result is just one line:

1 10048 . C CCT . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:66:.,0:.:0:.:0 0/1:32:.,2:.:2:.:0

And part of my vcf looks like this (without metadata):

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL PRIMARY

1 10048 . C CCT . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:66:.,0:.:0:.:0 0/1:32:.,2:.:2:.:0 1 10078 . CT C . CA VT=DEL;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1795;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:25:.,0:.:0:.:0 0/1:13:.,2:.:2:.:0 1 10177 . A AC . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1697;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/1:7:.,3:.:1:.:0 0/1:10:.,6:.:1:.:0 1 10212 . A AC . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1662;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:57:.,0:.:0:.:0 0/1:22:.,2:.:2:.:0

ADD REPLYlink written 6 months ago by aurora0

From which tool did you get this vcf ? How did you open your vcf file ? By default vcf delimiter is "\t" (tabulation), yours seems to be a " " (space).

I tried in this line

info_field_line = line.split("\t")[7]

to split each line of your vcf by the default delimiter which is a "tab" ("\t"), so in your case just replace "\t" by " ".

Something like this :

info_field_line = line.split(" ")[7]

Also, I asked you how you opened your vcf file. Because for me, in your comment, all your results are on the same line (without any carriage return). Your vcf should look like this :

1 10048 . C CCT . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:66:.,0:.:0:.:0 0/1:32:.,2:.:2:.:0

1 10078 . CT C . CA VT=DEL;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1795;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:25:.,0:.:0:.:0 0/1:13:.,2:.:2:.:0

1 10177 . A AC . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1697;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/1:7:.,3:.:1:.:0 0/1:10:.,6:.:1:.:0

etc...

But since the result of print(line) is correct, my solution should work. Just replace "\t" by " " in this line :

info_field_line = line.split("\t")[7]

That should be enough

ADD REPLYlink modified 6 months ago • written 6 months ago by Bastien Hervé1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 770 users visited in the last hour