Question: Filtering VCF with python
gravatar for miss
6 weeks ago by
miss0 wrote:

Hi, I extracted informations that I need from my vcf file and one line of it looks like this:

X 126165205 . T G DCAF12L2 synonymous_SNV p.P240P 0.4172 0.4089 0.543046 1/1 25 0 25 1/1 20 0 20

the line represents respectively: chromosome, position, ID, ref, alt, gene name, exonic function, AAchange (just last value), gnomAD_genome_ALL, ExAC_ALL, 1000g2015aug_all, GT, DP, RD, AD first in normal then tumor. Now I need to make a python script which will take as an input this vcf file and a list of mutations (exonic function) that i have in file. Based on which mutation I want it will give me just that line of file. Then I need to put threshold which will be lower then a maximum number that I have among gnomAD_genome_ALL, ExAC_ALL, 1000g2015aug_all. Sum (depth) of RD and AD, ratio of RD and AD and block (tumor or normal). Script has to be general so I can change variables.. Thank you in advance

python vcf • 184 views
ADD COMMENTlink modified 6 weeks ago by Kevin Blighe17k • written 6 weeks ago by miss0

what have you written so far ?

ADD REPLYlink written 6 weeks ago by Pierre Lindenbaum106k

You can rely on commented python scripts made in C: parsing vcf file and A: VCF file help to make your own VCF parser.

ADD REPLYlink written 6 weeks ago by Bastien Hervé650

This sounds like homework, but PyVCF will help you parse VCF files and get the info you need.

ADD REPLYlink written 6 weeks ago by jared.andrews07680
gravatar for Kevin Blighe
6 weeks ago by
Kevin Blighe17k
University College London Cancer Institute
Kevin Blighe17k wrote:

You may take inspiration from this Python script that I wrote last year. This just splits multi-allelic sites in the same way as bcftools norm -m-any. Take a look at what it does and then edit it to suit your own needs.

Execute it as

python [Input VCF] [Output VCF]


#Name:          Kevin Blighe
#Email:         NA
#Date:          20th July 2017
#Function(s):   Splits multiallelels

import sys
import os

#Check the number of command line arguments
if not len(sys.argv)==3:
    print "\nError:\tincorrect number of command-line arguments"
    print "Syntax:\ [Input VCF] [Output VCF]\n"

if sys.argv[1]==sys.argv[2]:
    #boolOverwrite = raw_input("\nInput file is the same as the output file - overwrite input file (sim/nao)?\n")
    print "Error:\tInput file is the same as the output file - choose a different output file\n"

#File input
fileInput = open(sys.argv[1], "r")

#File output for split multi-alleles
fileOutput = open(sys.argv[2], "w")

#Loop through each line in the input file, and split multiallelic sites
print ("Splitting multi-allelic sites...")
for strLine in fileInput:
    #Strip the endline character from each input line
    strLine = strLine.rstrip("\n")

    #The '#' character in VCF format indicates that the line is a header. Ignore these and just output to the new file
    if strLine.startswith("#"):
        fileOutput.write(strLine + "\n")
        #Split the tab-delimited line into an array
        strArray = [splits for splits in strLine.split("\t") if splits is not ""]

        #Check first if it's multiallelic
        #Multi-allelic variants will have 2 calls in the VAR field, separated by a comma (',')
        if "," in strArray[4]:
            strVars = [splits for splits in strArray[4].split(",") if splits is not ""]
            iNumMultialleles = len(strVars)

            for i in range(0, (iNumMultialleles)):
                strArray[4] = strVars[i]
                fileOutputTemp.write("\t".join(strArray) + "\n")
            fileOutput.write("\t".join(strArray) + "\n")
print ("Done.")

#Close the files
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1573 users visited in the last hour