Question: Programming Challenge - Synthetic Whole Genome Vcf
8
gravatar for Mahdi Sarmady
4.9 years ago by
Mahdi Sarmady290
USA
Mahdi Sarmady290 wrote:

I need a script that takes a reference (say hg19v37 fasta) and generates a vcf file for all defined bases of all chromosomes of the reference (ignore Ns). The idea is to generate a whole genome vcf with all possible single nucleotide variants (not indels) so it can be used with different annotation tools.

The vcf needs to have all possible alternate bases in ALT column for each position (see example below)

#CHROM POS     ID        REF    ALT     QUAL FILTER INFO
1     10001   . T      A,C,G       100   PASS   DP=100 
1     10002   . A      C,G,T       100   PASS   DP=100 
1     10003   . A      C,G,T       100   PASS   DP=100 
1     10004   . C      A,G,T       100   PASS   DP=100

Any ideas?

EDIT: Some benchmarks for the proposed solutions using hg19:chr20

 User      Code_version       Language       Total_chars         Time (m:s)
 Pierre         1             C                  1490               1:04
 Pierre         2             C                  1559               1:07
 matted         2             Python              543               1:45
 Istvan         2             Python              369               2:37
 Istvan         1             Python              506               3:03
 JC             3             Perl                393               3:16
 JC             1             Perl                735               7:31
 JC             2             Perl                552               8:03
 Rm             1             Awk                 434               9:22
 matted         1             Python              404              10:38

All tests were done in a Mac OS X 10.6.8, Intel Core 2 Duo 2.4 GHz, 2 GB 667 MHz DDR2 SDRAM

gcc => i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)

perl => v5.10.0 built for darwin-thread-multi-2level

python => v2.6.1 (r261:67515, GCC 4.2.1 Apple Inc. build 5646) on darwin

awk => v20070501

vcf python perl awk • 2.2k views
ADD COMMENTlink modified 4.9 years ago by Rm7.5k • written 4.9 years ago by Mahdi Sarmady290
1

36 cumulative votes and 289 views in less than a day :-) this really tickled our fancy

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Istvan Albert ♦♦ 73k

I think we need one code-golf challenge per week

ADD REPLYlink written 4.9 years ago by JC6.1k

You guys are amazing. I didn't think the question would trigger this much activity.

ADD REPLYlink written 4.9 years ago by Mahdi Sarmady290

what have you tried? Using the alignIO in biopython you can extract all the SNP information for each column of the alignment (you can ignore columns with a dash). At that point you can "easily" write the file...

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Whetting1.5k
9
gravatar for Pierre Lindenbaum
4.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum96k wrote:

Updated C version (only a switch/case):

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

#define SEQNAME_MAX BUFSIZ

#define SWITCH(letter1,letter2,code) case letter1: case letter2:\
                fputs(name,stdout);\
                fputc('\t',stdout);\
                printf("%d",(pos+1));\
                fputs("\t.\t",stdout);\
                fputc(c,stdout);\
                fputc('\t',stdout);\
                fputs(code,stdout);\
                fputs("\t100\tPASS\tDP=100\n",stdout);\
                ++pos;\
                break


int main(int argc,char** argv)
    {
    int c;
    int pos=0;
    char name[SEQNAME_MAX];
    name[0]=0;
    fputs("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n",stdout);
    for(;;)
            {
            switch((c=fgetc(stdin)))
             {
         case EOF:  return EXIT_SUCCESS;
        case '>':
            {
            int space=0;
            int name_length=0;
            name[0]=0;
            pos=0;
            while((c=fgetc(stdin))!=EOF && c!='\n')
                {    
                if(space) continue;
                if(isspace(c)) { space=1; continue;}
                name[name_length++]=c;
                }
            name[name_length]=0;
            break;
            }
                SWITCH('c','C',"A,T,G");
                SWITCH('a','A',"C,T,G");
                SWITCH('t','T',"C,A,G");
                SWITCH('g','G',"C,A,T");
            case ' ':case '\n' : case '\r': break;/* space, ignore */
                default:++pos;break;
                }
        }
     return EXIT_SUCCESS;
    }

compile:

gcc -o biostar52398 -Wall -O3 file.c

test:

$ curl -s ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip  -c | biostar52398  | head -n 20
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
1    10167    .    T    C,A,G    100    PASS    DP=100
1    10168    .    A    C,T,G    100    PASS    DP=100
1    10169    .    A    C,T,G    100    PASS    DP=100
1    10170    .    C    A,T,G    100    PASS    DP=100
1    10171    .    C    A,T,G    100    PASS    DP=100
1    10172    .    C    A,T,G    100    PASS    DP=100
1    10173    .    T    C,A,G    100    PASS    DP=100
1    10174    .    A    C,T,G    100    PASS    DP=100
1    10175    .    A    C,T,G    100    PASS    DP=100
1    10176    .    C    A,T,G    100    PASS    DP=100
1    10177    .    C    A,T,G    100    PASS    DP=100
1    10178    .    C    A,T,G    100    PASS    DP=100
1    10179    .    T    C,A,G    100    PASS    DP=100
1    10180    .    A    C,T,G    100    PASS    DP=100
1    10181    .    A    C,T,G    100    PASS    DP=100
1    10182    .    C    A,T,G    100    PASS    DP=100
1    10183    .    C    A,T,G    100    PASS    DP=100
1    10184    .    C    A,T,G    100    PASS    DP=100
1    10185    .    T    C,A,G    100    PASS    DP=100
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Pierre Lindenbaum96k
2

Small bug: ++pos will count line breaks (\n characters) and so the POS field is wrong.

ADD REPLYlink written 4.9 years ago by matted6.5k

fixed. Thanks :-)

ADD REPLYlink written 4.9 years ago by Pierre Lindenbaum96k

I'm sure it could be faster by removing the 'if' tests and moving them into the switch-case statement. Anyway, there is not much difference between 1 minute and 10 minutes: you just need the correct answer to get all the possible predictions don't you ?

ADD REPLYlink written 4.9 years ago by Pierre Lindenbaum96k

Indeed. Thanks everyone!

ADD REPLYlink written 4.9 years ago by Byron0

I couldn't resist. I've updated my code; it's only a switch-case now.

ADD REPLYlink written 4.9 years ago by Pierre Lindenbaum96k

you better! we are closing in fast on the C solution

ADD REPLYlink written 4.9 years ago by Istvan Albert ♦♦ 73k

running again the benchmark

ADD REPLYlink written 4.9 years ago by JC6.1k

umm the new code is 3 seconds slower than the original version

ADD REPLYlink written 4.9 years ago by JC6.1k
8
gravatar for matted
4.9 years ago by
matted6.5k
Boston, United States
matted6.5k wrote:

I'm late to the party, but here's a Python version:

import sys, Bio.SeqIO

print "\t".join(["#CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO"])

for s in Bio.SeqIO.parse(sys.stdin, "fasta"):
    for i,c in enumerate(s.seq.tostring()):
        if c != "N": print "\t".join([s.id, str(i+1), ".", c,
                                      ",".join(sorted(set("ACGT").difference(c))),
                                      "100", "PASS", "DP=100"])

It won't be the fastest, but the code is short...

ADD COMMENTlink written 4.9 years ago by matted6.5k
7
gravatar for JC
4.9 years ago by
JC6.1k
Mexico
JC6.1k wrote:

I'm no sure the purpose of this big file, but here is my solution:

#!/usr/bin/perl

# genome2vcf.pl

use strict;
use warnings;

$/="\n>";

print join "\t", '#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO';
print "\n";
while (<>) {
    s/>//g;
    my ($chr, @seq) = split (/\n/, $_);
    $chr =~ s/chr//;
    my $seq = join "", @seq;
    my $len = length $seq;
    for (my $i = 0; $i <= $len; $i++) {
        my $b = uc(substr($seq, $i, 1));
        next if ($b =~ m/[^ACGT]/);
        my $alt = 'N';
        if    ($b eq 'A') { $alt = 'C,G,T'; }
        elsif ($b eq 'C') { $alt = 'A,G,T'; }
        elsif ($b eq 'G') { $alt = 'A,C,T'; }
        elsif ($b eq 'T') { $alt = 'A,C,G'; }
        print join "\t", $chr, $i + 1, '.', $b, $alt, 100, 'PASS', 'DP=100';
        print "\n";
    }
}

then you can run:

perl genome2vcf.pl < hg19.fa > hg19.vcf
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by JC6.1k
2

after seeing that the "code-golf" tag (why Pierre's code has more up-votes?), I create a shorter version:

#!/usr/bin/perl
print join "\t",'#CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO';
print "\n";
while (<>) {
    chomp;
    if (m/>(.+)/) { $chr = $1; $i = 0; next; }
    if (m/[ACGT]/i) {
        for (my $j = 0; $j <= length $_; $j++) {
            $b = uc(substr($_, $j, 1));
            next if ($b =~ m/[^ACGT]/);
            $alt = 'A,C,G,T,'; $alt =~ s/$b,//; $alt =~ s/,$//;
            print join "\t", $chr,$i+$j+1,'.',$b,$alt,100,'PASS','DP=100';
            print "\n";
        }
    }   
    $i += length $_;
}

*edit: some tricks for alternative alleles and a bug fixed thanks to matted

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by JC6.1k
1

I think it's just a reward for daring to enter a code-golf competition while wielding the C programming language

ADD REPLYlink written 4.9 years ago by Istvan Albert ♦♦ 73k

well at least isn't Java or C# ...

ADD REPLYlink written 4.9 years ago by JC6.1k

Since I already started checking for bugs: this won't handle output POS correctly either. Lines of all N won't correctly increase $i. If you run on Pierre's example, you'll get:

#CHROM POS ID REF ALT QUAL FILTER INFO
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 41 . T A,C,G 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 42 . A C,G,T 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 43 . A C,G,T 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 44 . C A,G,T 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 45 . C A,G,T 100 PASS DP=100

Pierre's positions are wrong too, though... the correct ones start at 10001. Verify with:

$ samtools faidx temp.fa 1:10001-10021
>1:10001-10019
TAACCCTAACCCTAACCCT

$ samtools faidx temp.fa 1:9991-10009
>1:9991-10009
NNNNNNNNNNTAACCCTAA
ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by matted6.5k

oooppps, I've fixed the bug :-)

ADD REPLYlink written 4.9 years ago by Pierre Lindenbaum96k

thanks for spotting the bug, I corrected it

ADD REPLYlink written 4.9 years ago by JC6.1k
1

I ran some bechmarks with hg19:chr20 for all the solutions:

 User      Code_version       Language       Total_chars         Time (m:s)
 Pierre         1             C                  1490               1:04
 Pierre         2             C                  1559               1:07
 matted         2             Python              543               1:45
 Istvan         2             Python              369               2:37
 Istvan         1             Python              506               3:03
 JC             3             Perl                393               3:16
 JC             1             Perl                735               7:31
 JC             2             Perl                552               8:03
 Rm             1             Awk                 434               9:22
 matted         1             Python              404              10:38
ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by JC6.1k

could you please add this benchmark to the top post as well and update it there, it is getting harder to find it

ADD REPLYlink written 4.9 years ago by Istvan Albert ♦♦ 73k

sure, first I will run matted's new python code and I will copy/paste to the top post.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by JC6.1k

third Perl version after copying Istvar's optimizations:

 print join "\t",'#CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO';
 print "\n";
 %a = ('A'=>'C,G,T', 'C'=>'A,G,T', 'G'=>'A,C,T', 'T'=>'A,C,G');
 while (<>) {
     chomp;
     if (m/>(.+)/) { $chr = $1; $i = 0; }
     else {
         @a = split(//, uc $_);
         foreach $b (@a) {
             $i++;
             if ($a{$b}) {
                 print join "\t", $chr, $i, '.', $b, $a{$b}, 100, 'PASS', 'DP=100';
                 print "\n";
             }
         }
     }
 }

really close to Istvar's Python, but still far from Pierre's C

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by JC6.1k
4
gravatar for Istvan Albert
4.9 years ago by
Istvan Albert ♦♦ 73k
University Park, USA
Istvan Albert ♦♦ 73k wrote:

Ok, I can't let the Python version be the slowest. This should be end up being around the second fastest after the C (at least in my calculations it is about three times faster than matted's).

import sys
from itertools import count, imap

print "\t".join("#CHROM POS ID REF ALT QUAL FILTER INFO".split())

patt = "\t".join("%s %s . %s %s 100 PASS DP=100".split())

change = dict(A="C,G,T", C="A,G,T", T="A,C,G", G="A,C,T")
for line in sys.stdin:
    if line[0]=='>': 
        chrom = line[1:-1].split()[0]
        pos = count(1)
    else:
        for base in line.strip().upper():
            index = pos.next()
            if base == "N": continue
            print patt % (chrom, index, base, change[base])
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Istvan Albert ♦♦ 73k

Great job! I ran the same test, now you're second in speed and third in size. I had to upper-case all the sequence, you aren't considering soft-masking bases.

ADD REPLYlink written 4.9 years ago by JC6.1k

ah yes, I had a test file with all uppercased bases - made the change in the code above as well

ADD REPLYlink written 4.9 years ago by Istvan Albert ♦♦ 73k
3
gravatar for Rm
4.9 years ago by
Rm7.5k
Danville, PA
Rm7.5k wrote:

tooooo late?: Awk one (long) liner solution: Genome sequence to VCF

cat genome.fasta | awk -v dna="ATGC" '{if (NR==1){printf "%s\n", "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"}; split(dna, atgc, "") ; if (/^>/){sub(">","") ;chr=$1 } else { split($0, chars, "") ; for (i=1; i <= length(chars); i++) { count++ ; if (chars[i] ~/[ATGC]/){ printf("%s\t%s\t%s\t%s\t", chr, count, ".", chars[i]) ; for(j in atgc){if (chars[i]!=atgc[j]){printf("%s,", atgc[j])}} printf ("%s\t%s\t%s\t%s\n", "", 100,"PASS","DP=100") } } } }' > genome.vcf

Sample Output:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10001   .       T       C,A,G,  100     PASS    DP=100
1       10002   .       A       C,T,G,  100     PASS    DP=100
1       10003   .       A       C,T,G,  100     PASS    DP=100
1       10004   .       C       A,T,G,  100     PASS    DP=100
1       10005   .       C       A,T,G,  100     PASS    DP=100
1       10006   .       C       A,T,G,  100     PASS    DP=100
1       10007   .       T       C,A,G,  100     PASS    DP=100
1       10008   .       A       C,T,G,  100     PASS    DP=100
1       10009   .       A       C,T,G,  100     PASS    DP=100
ADD COMMENTlink written 4.9 years ago by Rm7.5k
3
gravatar for Istvan Albert
4.9 years ago by
Istvan Albert ♦♦ 73k
University Park, USA
Istvan Albert ♦♦ 73k wrote:

I will respectfully resubmit a version that I don't condone but it will become the shortest at 368 characters, and probably the 2nd fastest (it seems to be faster than my previous python version):

import sys

print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
p ="%s\t%s\t.\t%s\t%s\t100\tPASS\tDP=100"

m = dict(A="C,G,T", C="A,G,T", T="A,C,G", G="A,C,T")
for s in sys.stdin:
    if s[0]=='>': 
        i, c = 0, s[1:-1].split()[0]
    else:
        for b in s.strip().upper():
            i += 1
            if b in m: 
                print p % (c, i, b, m[b])
ADD COMMENTlink written 4.9 years ago by Istvan Albert ♦♦ 73k
2

I'm almost embarrassed to post this bad code, but I wanted to join you in your mission of fast(er) Python. I based it on your implementation, but made it much less readable. The upside is that it's a fair bit faster. On my machine, running on hg19 chr20, actually writing to disk, I get:

Pierre  C++  33 sec (1.00 normalized to C++)
matted  python  53 sec (1.58 normalized to C++)
Istvan python  95 sec (2.88 normalized to C++)

Code:

import sys

print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
p ="%s\t%s\t.\t%s\t%s\t100\tPASS\tDP=100\n"

v = ["C,G,T","A,G,T","A,C,G","A,C,T"]
m = dict(zip("ACTGactg",v+v))

def go(lines):
    i, c = 0, ""
    for s in lines:
        try:
            yield "".join([p % (c, i+j, b, m[b]) for j,b in enumerate(s[:-1])])
            i += j+1
        except KeyError:
            try:
                yield "".join([p % (c, i+j, b, m[b]) for j,b in enumerate(s[:-1]) if b != 'N'])
                i += j+1
            except KeyError:
                i, c = 1, s[1:-1].split()[0]

sys.stdout.writelines(go(sys.stdin.readlines()))
ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by matted6.5k
1

Wow, that is pretty nifty it is a about 30% faster than my version! I can't believe how close we are getting to pure C. Goes to show that the slowest parts of python are making a function call.

an improvement would be to replace the sys.stdin.readlines() function call with just sys.stdin, that will avoid reading the file into memory and should not impact the speed at all.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Istvan Albert ♦♦ 73k

yes, it's faster and compact. Benchmark added.

ADD REPLYlink written 4.9 years ago by JC6.1k
1
gravatar for Rm
4.9 years ago by
Rm7.5k
Danville, PA
Rm7.5k wrote:

My improved Awk oneliner:

awk -v s="100\tPASS\tDP=100" '{OFS="\t"; if (NR==1)print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"; if (/^>/){sub(">",""); chr=$1} else {split($0, chars, ""); for(i=1; i <= length(chars); i++) {count++ ; if(chars[i] ~/[ATGC]/){alt = "A,C,G,T"; sub(chars[i], "", alt); sub(/^,|,$/, "", alt); sub(/,,/, ",", alt); print chr, count, ".", chars[i], alt, s} } } }' hs_ref_GRCh37.p9_chr20.fa >chr20.vcf
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Rm7.5k

I ran the test, it's slower than your previous version (28m 18s)

ADD REPLYlink written 4.9 years ago by JC6.1k

surprising: my previous one: real 8m0.254s this one: real 4m11.273s

ADD REPLYlink written 4.9 years ago by Rm7.5k

yes, sounds weird, I'm gonna run again and check if there are some hidden process in my mac

ADD REPLYlink written 4.9 years ago by JC6.1k

I found the problem, AWK is using all my memory (2GB) and the system is swapping

ADD REPLYlink written 4.9 years ago by JC6.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 743 users visited in the last hour