Programming Challenge - Synthetic Whole Genome Vcf
7
10
Entering edit mode
12.0 years ago
Mahdi Sarmady ▴ 310

I need a script that takes a reference (say hg19v37 fasta) and generates a vcf file for all defined bases of all chromosomes of the reference (ignore Ns). The idea is to generate a whole genome vcf with all possible single nucleotide variants (not indels) so it can be used with different annotation tools.

The vcf needs to have all possible alternate bases in ALT column for each position (see example below)

#CHROM POS     ID        REF    ALT     QUAL FILTER INFO
1     10001   . T      A,C,G       100   PASS   DP=100 
1     10002   . A      C,G,T       100   PASS   DP=100 
1     10003   . A      C,G,T       100   PASS   DP=100 
1     10004   . C      A,G,T       100   PASS   DP=100

Any ideas?

EDIT: Some benchmarks for the proposed solutions using hg19:chr20

 User      Code_version       Language       Total_chars         Time (m:s)
 Pierre         1             C                  1490               1:04
 Pierre         2             C                  1559               1:07
 matted         2             Python              543               1:45
 Istvan         2             Python              369               2:37
 Istvan         1             Python              506               3:03
 JC             3             Perl                393               3:16
 JC             1             Perl                735               7:31
 JC             2             Perl                552               8:03
 Rm             1             Awk                 434               9:22
 matted         1             Python              404              10:38

All tests were done in a Mac OS X 10.6.8, Intel Core 2 Duo 2.4 GHz, 2 GB 667 MHz DDR2 SDRAM

gcc => i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)

perl => v5.10.0 built for darwin-thread-multi-2level

python => v2.6.1 (r261:67515, GCC 4.2.1 Apple Inc. build 5646) on darwin

awk => v20070501

vcf python perl awk • 6.2k views
ADD COMMENT
1
Entering edit mode

36 cumulative votes and 289 views in less than a day :-) this really tickled our fancy

ADD REPLY
0
Entering edit mode

I think we need one code-golf challenge per week

ADD REPLY
0
Entering edit mode

You guys are amazing. I didn't think the question would trigger this much activity.

ADD REPLY
0
Entering edit mode

what have you tried? Using the alignIO in biopython you can extract all the SNP information for each column of the alignment (you can ignore columns with a dash). At that point you can "easily" write the file...

ADD REPLY
9
Entering edit mode
12.0 years ago

Updated C version (only a switch/case):

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

#define SEQNAME_MAX BUFSIZ

#define SWITCH(letter1,letter2,code) case letter1: case letter2:\
                fputs(name,stdout);\
                fputc('\t',stdout);\
                printf("%d",(pos+1));\
                fputs("\t.\t",stdout);\
                fputc(c,stdout);\
                fputc('\t',stdout);\
                fputs(code,stdout);\
                fputs("\t100\tPASS\tDP=100\n",stdout);\
                ++pos;\
                break


int main(int argc,char** argv)
    {
    int c;
    int pos=0;
    char name[SEQNAME_MAX];
    name[0]=0;
    fputs("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n",stdout);
    for(;;)
            {
            switch((c=fgetc(stdin)))
             {
         case EOF:  return EXIT_SUCCESS;
        case '>':
            {
            int space=0;
            int name_length=0;
            name[0]=0;
            pos=0;
            while((c=fgetc(stdin))!=EOF && c!='\n')
                {    
                if(space) continue;
                if(isspace(c)) { space=1; continue;}
                name[name_length++]=c;
                }
            name[name_length]=0;
            break;
            }
                SWITCH('c','C',"A,T,G");
                SWITCH('a','A',"C,T,G");
                SWITCH('t','T',"C,A,G");
                SWITCH('g','G',"C,A,T");
            case ' ':case '\n' : case '\r': break;/* space, ignore */
                default:++pos;break;
                }
        }
     return EXIT_SUCCESS;
    }

compile:

gcc -o biostar52398 -Wall -O3 file.c

test:

$ curl -s ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip  -c | biostar52398  | head -n 20
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
1    10167    .    T    C,A,G    100    PASS    DP=100
1    10168    .    A    C,T,G    100    PASS    DP=100
1    10169    .    A    C,T,G    100    PASS    DP=100
1    10170    .    C    A,T,G    100    PASS    DP=100
1    10171    .    C    A,T,G    100    PASS    DP=100
1    10172    .    C    A,T,G    100    PASS    DP=100
1    10173    .    T    C,A,G    100    PASS    DP=100
1    10174    .    A    C,T,G    100    PASS    DP=100
1    10175    .    A    C,T,G    100    PASS    DP=100
1    10176    .    C    A,T,G    100    PASS    DP=100
1    10177    .    C    A,T,G    100    PASS    DP=100
1    10178    .    C    A,T,G    100    PASS    DP=100
1    10179    .    T    C,A,G    100    PASS    DP=100
1    10180    .    A    C,T,G    100    PASS    DP=100
1    10181    .    A    C,T,G    100    PASS    DP=100
1    10182    .    C    A,T,G    100    PASS    DP=100
1    10183    .    C    A,T,G    100    PASS    DP=100
1    10184    .    C    A,T,G    100    PASS    DP=100
1    10185    .    T    C,A,G    100    PASS    DP=100
ADD COMMENT
2
Entering edit mode

Small bug: ++pos will count line breaks (\n characters) and so the POS field is wrong.

ADD REPLY
0
Entering edit mode

fixed. Thanks :-)

ADD REPLY
0
Entering edit mode

I'm sure it could be faster by removing the 'if' tests and moving them into the switch-case statement. Anyway, there is not much difference between 1 minute and 10 minutes: you just need the correct answer to get all the possible predictions don't you ?

ADD REPLY
0
Entering edit mode

Indeed. Thanks everyone!

ADD REPLY
0
Entering edit mode

I couldn't resist. I've updated my code; it's only a switch-case now.

ADD REPLY
0
Entering edit mode

you better! we are closing in fast on the C solution

ADD REPLY
0
Entering edit mode

running again the benchmark

ADD REPLY
0
Entering edit mode

umm the new code is 3 seconds slower than the original version

ADD REPLY
8
Entering edit mode
12.0 years ago
matted 7.8k

I'm late to the party, but here's a Python version:

import sys, Bio.SeqIO

print "\t".join(["#CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO"])

for s in Bio.SeqIO.parse(sys.stdin, "fasta"):
    for i,c in enumerate(s.seq.tostring()):
        if c != "N": print "\t".join([s.id, str(i+1), ".", c,
                                      ",".join(sorted(set("ACGT").difference(c))),
                                      "100", "PASS", "DP=100"])

It won't be the fastest, but the code is short...

ADD COMMENT
7
Entering edit mode
12.0 years ago
JC 13k

I'm no sure the purpose of this big file, but here is my solution:

#!/usr/bin/perl

# genome2vcf.pl

use strict;
use warnings;

$/="\n>";

print join "\t", '#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO';
print "\n";
while (<>) {
    s/>//g;
    my ($chr, @seq) = split (/\n/, $_);
    $chr =~ s/chr//;
    my $seq = join "", @seq;
    my $len = length $seq;
    for (my $i = 0; $i <= $len; $i++) {
        my $b = uc(substr($seq, $i, 1));
        next if ($b =~ m/[^ACGT]/);
        my $alt = 'N';
        if    ($b eq 'A') { $alt = 'C,G,T'; }
        elsif ($b eq 'C') { $alt = 'A,G,T'; }
        elsif ($b eq 'G') { $alt = 'A,C,T'; }
        elsif ($b eq 'T') { $alt = 'A,C,G'; }
        print join "\t", $chr, $i + 1, '.', $b, $alt, 100, 'PASS', 'DP=100';
        print "\n";
    }
}

then you can run:

perl genome2vcf.pl < hg19.fa > hg19.vcf
ADD COMMENT
2
Entering edit mode

after seeing that the "code-golf" tag (why Pierre's code has more up-votes?), I create a shorter version:

#!/usr/bin/perl
print join "\t",'#CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO';
print "\n";
while (<>) {
    chomp;
    if (m/>(.+)/) { $chr = $1; $i = 0; next; }
    if (m/[ACGT]/i) {
        for (my $j = 0; $j <= length $_; $j++) {
            $b = uc(substr($_, $j, 1));
            next if ($b =~ m/[^ACGT]/);
            $alt = 'A,C,G,T,'; $alt =~ s/$b,//; $alt =~ s/,$//;
            print join "\t", $chr,$i+$j+1,'.',$b,$alt,100,'PASS','DP=100';
            print "\n";
        }
    }   
    $i += length $_;
}

*edit: some tricks for alternative alleles and a bug fixed thanks to matted

ADD REPLY
1
Entering edit mode

I think it's just a reward for daring to enter a code-golf competition while wielding the C programming language

ADD REPLY
0
Entering edit mode

well at least isn't Java or C# ...

ADD REPLY
0
Entering edit mode

Since I already started checking for bugs: this won't handle output POS correctly either. Lines of all N won't correctly increase $i. If you run on Pierre's example, you'll get:

#CHROM POS ID REF ALT QUAL FILTER INFO
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 41 . T A,C,G 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 42 . A C,G,T 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 43 . A C,G,T 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 44 . C A,G,T 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 45 . C A,G,T 100 PASS DP=100

Pierre's positions are wrong too, though... the correct ones start at 10001. Verify with:

$ samtools faidx temp.fa 1:10001-10021
>1:10001-10019
TAACCCTAACCCTAACCCT

$ samtools faidx temp.fa 1:9991-10009
>1:9991-10009
NNNNNNNNNNTAACCCTAA
ADD REPLY
0
Entering edit mode

oooppps, I've fixed the bug :-)

ADD REPLY
0
Entering edit mode

thanks for spotting the bug, I corrected it

ADD REPLY
1
Entering edit mode

I ran some bechmarks with hg19:chr20 for all the solutions:

 User      Code_version       Language       Total_chars         Time (m:s)
 Pierre         1             C                  1490               1:04
 Pierre         2             C                  1559               1:07
 matted         2             Python              543               1:45
 Istvan         2             Python              369               2:37
 Istvan         1             Python              506               3:03
 JC             3             Perl                393               3:16
 JC             1             Perl                735               7:31
 JC             2             Perl                552               8:03
 Rm             1             Awk                 434               9:22
 matted         1             Python              404              10:38
ADD REPLY
0
Entering edit mode

could you please add this benchmark to the top post as well and update it there, it is getting harder to find it

ADD REPLY
0
Entering edit mode

sure, first I will run matted's new python code and I will copy/paste to the top post.

ADD REPLY
0
Entering edit mode

third Perl version after copying Istvar's optimizations:

 print join "\t",'#CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO';
 print "\n";
 %a = ('A'=>'C,G,T', 'C'=>'A,G,T', 'G'=>'A,C,T', 'T'=>'A,C,G');
 while (<>) {
     chomp;
     if (m/>(.+)/) { $chr = $1; $i = 0; }
     else {
         @a = split(//, uc $_);
         foreach $b (@a) {
             $i++;
             if ($a{$b}) {
                 print join "\t", $chr, $i, '.', $b, $a{$b}, 100, 'PASS', 'DP=100';
                 print "\n";
             }
         }
     }
 }

really close to Istvar's Python, but still far from Pierre's C

ADD REPLY
4
Entering edit mode
12.0 years ago

Ok, I can't let the Python version be the slowest. This should be end up being around the second fastest after the C (at least in my calculations it is about three times faster than matted's).

import sys
from itertools import count, imap

print "\t".join("#CHROM POS ID REF ALT QUAL FILTER INFO".split())

patt = "\t".join("%s %s . %s %s 100 PASS DP=100".split())

change = dict(A="C,G,T", C="A,G,T", T="A,C,G", G="A,C,T")
for line in sys.stdin:
    if line[0]=='>': 
        chrom = line[1:-1].split()[0]
        pos = count(1)
    else:
        for base in line.strip().upper():
            index = pos.next()
            if base == "N": continue
            print patt % (chrom, index, base, change[base])
ADD COMMENT
0
Entering edit mode

Great job! I ran the same test, now you're second in speed and third in size. I had to upper-case all the sequence, you aren't considering soft-masking bases.

ADD REPLY
0
Entering edit mode

ah yes, I had a test file with all uppercased bases - made the change in the code above as well

ADD REPLY
3
Entering edit mode
12.0 years ago
Rm 8.3k

tooooo late?: Awk one (long) liner solution: Genome sequence to VCF

cat genome.fasta | awk -v dna="ATGC" '{if (NR==1){printf "%s\n", "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"}; split(dna, atgc, "") ; if (/^>/){sub(">","") ;chr=$1 } else { split($0, chars, "") ; for (i=1; i <= length(chars); i++) { count++ ; if (chars[i] ~/[ATGC]/){ printf("%s\t%s\t%s\t%s\t", chr, count, ".", chars[i]) ; for(j in atgc){if (chars[i]!=atgc[j]){printf("%s,", atgc[j])}} printf ("%s\t%s\t%s\t%s\n", "", 100,"PASS","DP=100") } } } }' > genome.vcf

Sample Output:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10001   .       T       C,A,G,  100     PASS    DP=100
1       10002   .       A       C,T,G,  100     PASS    DP=100
1       10003   .       A       C,T,G,  100     PASS    DP=100
1       10004   .       C       A,T,G,  100     PASS    DP=100
1       10005   .       C       A,T,G,  100     PASS    DP=100
1       10006   .       C       A,T,G,  100     PASS    DP=100
1       10007   .       T       C,A,G,  100     PASS    DP=100
1       10008   .       A       C,T,G,  100     PASS    DP=100
1       10009   .       A       C,T,G,  100     PASS    DP=100
ADD COMMENT
0
Entering edit mode

This is incorrect, fine for chr1, fails for the others:

2 249260622 . C A,T,G, 100 PASS DP=100

2 249260623 . G C,A,T, 100 PASS DP=100

2 249260624 . T C,A,G, 100 PASS DP=100

ADD REPLY
3
Entering edit mode
12.0 years ago

I will respectfully resubmit a version that I don't condone but it will become the shortest at 368 characters, and probably the 2nd fastest (it seems to be faster than my previous python version):

import sys

print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
p ="%s\t%s\t.\t%s\t%s\t100\tPASS\tDP=100"

m = dict(A="C,G,T", C="A,G,T", T="A,C,G", G="A,C,T")
for s in sys.stdin:
    if s[0]=='>': 
        i, c = 0, s[1:-1].split()[0]
    else:
        for b in s.strip().upper():
            i += 1
            if b in m: 
                print p % (c, i, b, m[b])
ADD COMMENT
2
Entering edit mode

I'm almost embarrassed to post this bad code, but I wanted to join you in your mission of fast(er) Python. I based it on your implementation, but made it much less readable. The upside is that it's a fair bit faster. On my machine, running on hg19 chr20, actually writing to disk, I get:

Pierre  C++  33 sec (1.00 normalized to C++)
matted  python  53 sec (1.58 normalized to C++)
Istvan python  95 sec (2.88 normalized to C++)

Code:

import sys

print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
p ="%s\t%s\t.\t%s\t%s\t100\tPASS\tDP=100\n"

v = ["C,G,T","A,G,T","A,C,G","A,C,T"]
m = dict(zip("ACTGactg",v+v))

def go(lines):
    i, c = 0, ""
    for s in lines:
        try:
            yield "".join([p % (c, i+j, b, m[b]) for j,b in enumerate(s[:-1])])
            i += j+1
        except KeyError:
            try:
                yield "".join([p % (c, i+j, b, m[b]) for j,b in enumerate(s[:-1]) if b != 'N'])
                i += j+1
            except KeyError:
                i, c = 1, s[1:-1].split()[0]

sys.stdout.writelines(go(sys.stdin.readlines()))
ADD REPLY
1
Entering edit mode

Wow, that is pretty nifty it is a about 30% faster than my version! I can't believe how close we are getting to pure C. Goes to show that the slowest parts of python are making a function call.

an improvement would be to replace the sys.stdin.readlines() function call with just sys.stdin, that will avoid reading the file into memory and should not impact the speed at all.

ADD REPLY
0
Entering edit mode

yes, it's faster and compact. Benchmark added.

ADD REPLY
1
Entering edit mode
12.0 years ago
Rm 8.3k

My improved Awk oneliner:

awk -v s="100\tPASS\tDP=100" '{OFS="\t"; if (NR==1)print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"; if (/^>/){sub(">",""); chr=$1} else {split($0, chars, ""); for(i=1; i <= length(chars); i++) {count++ ; if(chars[i] ~/[ATGC]/){alt = "A,C,G,T"; sub(chars[i], "", alt); sub(/^,|,$/, "", alt); sub(/,,/, ",", alt); print chr, count, ".", chars[i], alt, s} } } }' hs_ref_GRCh37.p9_chr20.fa >chr20.vcf
ADD COMMENT
0
Entering edit mode

I ran the test, it's slower than your previous version (28m 18s)

ADD REPLY
0
Entering edit mode

surprising: my previous one: real 8m0.254s this one: real 4m11.273s

ADD REPLY
0
Entering edit mode

yes, sounds weird, I'm gonna run again and check if there are some hidden process in my mac

ADD REPLY
0
Entering edit mode

I found the problem, AWK is using all my memory (2GB) and the system is swapping

ADD REPLY

Login before adding your answer.

Traffic: 1240 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6