Question: Filtering VCF with python
gravatar for miss
10 months ago by
miss0 wrote:

Hi, I extracted informations that I need from my vcf file and one line of it looks like this:

X 126165205 . T G DCAF12L2 synonymous_SNV p.P240P 0.4172 0.4089 0.543046 1/1 25 0 25 1/1 20 0 20

the line represents respectively: chromosome, position, ID, ref, alt, gene name, exonic function, AAchange (just last value), gnomAD_genome_ALL, ExAC_ALL, 1000g2015aug_all, GT, DP, RD, AD first in normal then tumor. Now I need to make a python script which will take as an input this vcf file and a list of mutations (exonic function) that i have in file. Based on which mutation I want it will give me just that line of file. Then I need to put threshold which will be lower then a maximum number that I have among gnomAD_genome_ALL, ExAC_ALL, 1000g2015aug_all. Sum (depth) of RD and AD, ratio of RD and AD and block (tumor or normal). Script has to be general so I can change variables.. Thank you in advance

python vcf • 861 views
ADD COMMENTlink modified 10 months ago by Kevin Blighe35k • written 10 months ago by miss0

what have you written so far ?

ADD REPLYlink written 10 months ago by Pierre Lindenbaum116k

You can rely on commented python scripts made in C: parsing vcf file and A: VCF file help to make your own VCF parser.

ADD REPLYlink written 10 months ago by Bastien Hervé2.8k

This sounds like homework, but PyVCF will help you parse VCF files and get the info you need.

ADD REPLYlink written 10 months ago by jared.andrews071.8k
gravatar for Kevin Blighe
10 months ago by
Kevin Blighe35k
Republic of Ireland
Kevin Blighe35k wrote:

You may take inspiration from this Python script that I wrote last year. This just splits multi-allelic sites in the same way as bcftools norm -m-any. Take a look at what it does and then edit it to suit your own needs.

Execute it as

python [Input VCF] [Output VCF]


#Name:          Kevin Blighe
#Email:         NA
#Date:          20th July 2017
#Function(s):   Splits multiallelels

import sys
import os

#Check the number of command line arguments
if not len(sys.argv)==3:
    print "\nError:\tincorrect number of command-line arguments"
    print "Syntax:\ [Input VCF] [Output VCF]\n"

if sys.argv[1]==sys.argv[2]:
    #boolOverwrite = raw_input("\nInput file is the same as the output file - overwrite input file (sim/nao)?\n")
    print "Error:\tInput file is the same as the output file - choose a different output file\n"

#File input
fileInput = open(sys.argv[1], "r")

#File output for split multi-alleles
fileOutput = open(sys.argv[2], "w")

#Loop through each line in the input file, and split multiallelic sites
print ("Splitting multi-allelic sites...")
for strLine in fileInput:
    #Strip the endline character from each input line
    strLine = strLine.rstrip("\n")

    #The '#' character in VCF format indicates that the line is a header. Ignore these and just output to the new file
    if strLine.startswith("#"):
        fileOutput.write(strLine + "\n")
        #Split the tab-delimited line into an array
        strArray = [splits for splits in strLine.split("\t") if splits is not ""]

        #Check first if it's multiallelic
        #Multi-allelic variants will have 2 calls in the VAR field, separated by a comma (',')
        if "," in strArray[4]:
            strVars = [splits for splits in strArray[4].split(",") if splits is not ""]
            iNumMultialleles = len(strVars)

            for i in range(0, (iNumMultialleles)):
                strArray[4] = strVars[i]
                fileOutputTemp.write("\t".join(strArray) + "\n")
            fileOutput.write("\t".join(strArray) + "\n")
print ("Done.")

#Close the files
ADD COMMENTlink modified 10 months ago • written 10 months ago by Kevin Blighe35k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1616 users visited in the last hour