Question: Creation of a VCF file from scratch with python
0
gravatar for carolina.santiago.t
4 weeks ago by
carolina.santiago.t0 wrote:

Hello! I want to make a VCF file with a header line syntax like "#CHROM POS REF ALT". Is it possible to create such a VCF file from scratch with python?

Thanks in advance

python vcf • 201 views
ADD COMMENTlink modified 4 weeks ago by finswimmer11k • written 4 weeks ago by carolina.santiago.t0
1

you can do it easily in pandas package.

ADD REPLYlink written 4 weeks ago by shoujun.gu370

Certainly. But what advantage are you going to gain by doing that? Are you trying to simulate data?

ADD REPLYlink written 4 weeks ago by genomax64k

I want to create a header that allows me to save some data that is not in the template CSV. So i want either to create a new CSV that i can use to save those fields or to manipulate a template in order to add those.

ADD REPLYlink written 4 weeks ago by carolina.santiago.t0

yes, it is possible in Python, Perl, Java, etc ;)

Please extend what exactly are you trying to do.

ADD REPLYlink written 4 weeks ago by JC7.6k

I want to create a header that allows me to save some data that is not in the template CSV. So i want either to create a new CSV that i can use to save those fields or to manipulate a template in order to add those.

ADD REPLYlink written 4 weeks ago by carolina.santiago.t0
2
gravatar for finswimmer
4 weeks ago by
finswimmer11k
Germany
finswimmer11k wrote:

A vcf file is a plain text file, that follow the rules by the specification for a valid vcf. As long as you take care of these rules, you can create this file how ever you want.

Be careful: There are lot of tools out there, that are satisfied, if the vcf contain just one header line, holding the column names: #CHROM POS ID REF ALT QUAL FILTER INFO This is not enough for a "real" valid vcf. For this the header must also include:

  • information about the file format version: ##fileformat=VCFv4.3
  • information about contig length for each contig used in the file, e.g. ##contig=<ID=chr1,length=249250621>
  • information about each key used in the INFO or FORMAT column, e.g.:
    • ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
    • ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

If you consider this right from the start, you will not have any problems using different vcf tools later. Especially bcftools is very strict about the header values.

When working with python, you could think about using one of the available modules to handle and create vcf file like pysam or cyvcf.

ADD COMMENTlink written 4 weeks ago by finswimmer11k
0
gravatar for manuel.belmadani
4 weeks ago by
Canada
manuel.belmadani580 wrote:

Assuming you're using Python, you can start with a template like so (make sure your chromosome lengths are correct for your assembly):

##fileformat=VCFv4.1
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chrM,length=16571>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    Path
%CHROMOSOME %POSITION   %ID %REF    %ALT    .   .   .

And then read in this file row by row. Keep the each line in a list or something, extract the last line as a "template" and remove it from your list. Parse out the template for the placeholders, then, read in your variant data as a list of as a variant tuples. Join them up with the placeholders and substitute them. Stub example:

        output = variant_template # Variant template is the last line as a string
        PLACEHOLDERS = ["%"+X for X in "CHROMOSOME,POSITION,REF,ALT".split(",")] # Placeholders are what you replace. You could also just split the last row you extracted from the template file.
        for x,y in zip(PLACEHOLDERS, variant_tuple): # Pair up placeholders and variant data (assuming it's ordered in the same way.)
            output = output.replace(x,y) # Replace text

Append output to the file. Do this for each variant.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by manuel.belmadani580

Thank you! I will try this.

ADD REPLYlink written 4 weeks ago by carolina.santiago.t0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1168 users visited in the last hour