Question: Code Golf: Digesting Fasta Sequences Into A Set Of Smaller Sequences
15
gravatar for Eric Normandeau
7.0 years ago by
Eric Normandeau9.4k
Quebec, Canada
Eric Normandeau9.4k wrote:

Hi,

I had to come up with a code last week to solve this small problem and thought it may be another nice idea for a code golf.

Here are the requirements:

Given a fasta file (potentially huge), return a fasta file containing smaller sequences of length 'n' that represent all non-overlaping subsequences of the original sequences. Rename the new sequences according to the scheme 'oldSeqName_00001" and up, where 'oldSeqName' is the original name for the sequence.

Test file: please use http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz

example run : digest_fasta(fasta_in, 60, fasta_out)

inputs: - a number of nucleotides for each sub-sequence (n) - a fasta file

EXAMPLE:

>XYZ_gene
CGTGAAGACGCAGTACCAGCAGAGATGGCTGGCCATCGACGGCAACGCTCGCCGCGAGAT
CAAGAACTACGTTTTACAGACTCTGGGCACGGAGACGTACCGGCCCAGCTCGGCGTCGCA
GTGCGTCGCCGGCATCGCCTGCGCTGAGATCCCCGTTAACCAGTGGCCCGAGCTGATCCC
ACAGCTGGTGGCCAACGTCACGGACCCGTCCAGCACCGAACACATGAAGGAGTCCACGTT
GGAGGCCATCGGGTACATCTGCCAGGACATCGACCCGGAGCAGCTGCAGGAGAACGCCAA
CCAGATCCTGACGGCCATCATCCAGGGCATGAGGAAGGAGGAGCCCAGTAACAACGTGAA
GCTGGCCGCGACTAACGCTCTGCTCAACTCGCTGGAGTTCACTAAAGCCAACTTTGACAA
GGAGACGGAGAGACACTTCATCATGCAGG

outputs: - a fasta file

EXAMPLE (after digesting in 60pb long fragments):

>XYZ_gene_00001
CGTGAAGACGCAGTACCAGCAGAGATGGCTGGCCATCGACGGCAACGCTCGCCGCGAGAT
>XYZ_gene_00002
CAAGAACTACGTTTTACAGACTCTGGGCACGGAGACGTACCGGCCCAGCTCGGCGTCGCA
>XYZ_gene_00003
GTGCGTCGCCGGCATCGCCTGCGCTGAGATCCCCGTTAACCAGTGGCCCGAGCTGATCCC
>XYZ_gene_00004
ACAGCTGGTGGCCAACGTCACGGACCCGTCCAGCACCGAACACATGAAGGAGTCCACGTT
>XYZ_gene_00005
GGAGGCCATCGGGTACATCTGCCAGGACATCGACCCGGAGCAGCTGCAGGAGAACGCCAA
>XYZ_gene_00006
CCAGATCCTGACGGCCATCATCCAGGGCATGAGGAAGGAGGAGCCCAGTAACAACGTGAA
>XYZ_gene_00007
GCTGGCCGCGACTAACGCTCTGCTCAACTCGCTGGAGTTCACTAAAGCCAACTTTGACAA
>XYZ_gene_00008
GGAGACGGAGAGACACTTCATCATGCAGG

And so on for all the sequences in the 'fasta_in' file!

For this week's requirement, I'll try to be clearer: The program should run under linux.

That's it!

Use whatever language, approach, library you favor! Submit multiple answers if they are significantly different!

I'll accept the most voted answer on thursday around 16h Eastern Canada. Do vote for any answer you find interesting, even if they come in late!

Just like last week, diversity will be very appreciated!

Can't wait to see you code ;)

CONCLUSION: Thanks to all who participated with their answers or discussions! It's a pleasure to see all the different approaches and learn new tricks :)

Cheers

fasta sequence code • 6.4k views
ADD COMMENTlink modified 7.0 years ago by Brad Chapman9.1k • written 7.0 years ago by Eric Normandeau9.4k
2

Dear whoever's downvoting without saying why -- it'd be much more constructive to find out why you didn't like a particular solution

ADD REPLYlink written 7.0 years ago by Andrew Clegg310
1

Yes keep deleting comments that are addressed or obsolete.

ADD REPLYlink written 7.0 years ago by Istvan Albert ♦♦ 73k

An example (and canonical) input file would probably help with this one.

ADD REPLYlink written 7.0 years ago by Simon Cockell7.2k

An example would indeed be appreciated. I'm not 100% sure what we are supposed to do this time.

ADD REPLYlink written 7.0 years ago by Lars Juhl Jensen11k

ok, on the way :)

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

I think it would be great to attach a bounty to these questions - a little more incentive to work hard for the best answer!

ADD REPLYlink written 7.0 years ago by Daniel Swan13k

Can we specify a common big fasta source ? ( ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/rs_ch1.fas.gz ? )

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum96k

That's a nice idea :)

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

Can we send the code on http://gist.github.com/ ?

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum96k

I suggest http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz as the "potentially huge fasta file"

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum96k

Not sure to follow you @Pierre. You mean INSTEAD of here?

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

@Daniel Swan, I think the question has to be unanswered for a few days before you put a bounty...

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

@eric, I first glance I thought the code could be large. I was wrong, please ignore my comment about gist.

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum96k

@eric I also removed dbsnp as the test file

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum96k

@Pierre. Please edit the question to put your suggested test file :)

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

done. @Eric so ''stdin'' shouldn't be handled isn't it ? :-)

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum96k

Note: in the fasta id, should the length parameter be adjusted? As here:

XYZ_gene (length = 60)_00001

or should the id be unchanged? I would opt for the latter.

ADD REPLYlink written 7.0 years ago by Michael Dondrup42k

nope, that was only for example purpose, I'll delete it...

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

Just a random thought: Do you people mind if I delete some comments as issues get cleared and the comments become less usefull just to keep the list readable?

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

out of curiosity, do you have Matlab ands its associated bioinformatics toolbox?

ADD REPLYlink written 7.0 years ago by Will4.4k

@Will negative for Matlab. I haven't toyed with it since my 'engineer' in training days, about 10 years ago :) It seems to me that with so many viable open and free solutions out there, Matlab is not a first choice for many people doing research (thinking mainly of the grad students here). What is there that makes Matlab an interesting tool in you opinion?

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

Just out of interest, what are you using the n-mers in the output file for?

ADD REPLYlink written 7.0 years ago by Cassj1.3k

@CassJ Just one example. I could have long contigs (10 t0 50 kpb) representing assembly of 454 data sequenced from a 100+ kbp insert in which I know that there is a gene of interest, maybe in multible copies, but also possibly other genes. I can use 500 pb sub-sequences and blast them against a db (nr, swissprot, a custom DNA db) and try to identify the position of that gene.

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

Code golf: it's learning, and advocacy, and showing off all rolled into one! What programmer wouldn't love that?

Last week you suggested discussing problems on a google group. I went to google groups and searched for "project mendel" and saw the existing thread on the biostar-central group. I thought maybe it wouldn't be neighborly (to the people participating in the current thread) to start a separate thread elsewhere, but on the other hand I don't know if biostar-central really wants a big question thread. Of course, that's assuming that we'd get enough people to make a big thread!

ADD REPLYlink written 7.0 years ago by Mitch Skinner650

Plus, I've been thinking about my own work and trying to imagine good questions, but so far I haven't thought of many problems that are the right size. Maybe one so far.

ADD REPLYlink written 7.0 years ago by Mitch Skinner650
15
gravatar for Andrew Clegg
7.0 years ago by
Andrew Clegg310
London
Andrew Clegg310 wrote:

GNU Coreutils version, for single-sequence file only:

grep -v '^>' chr1.fa | tr -d '\n' | fold -w 60 | nl -n rz -s '
' | sed 's/^/>fragment_/;N' > output

The naming convention's not the same as the original question, but that's true of a lot of these :-)

Speed:

real    0m17.182s
user    0m19.255s
sys     0m1.634s

Memory usage: practically 0, it's all done via streams.

Multi-sequence version later, maybe.

EDITED to change first regexp from > to ^> which shaved about a second off.

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Andrew Clegg310

cool ! I didn't know the fold cmd !

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum96k

Neither did I, but I knew ONE of the coreutils must be able to do that ;-)

ADD REPLYlink written 7.0 years ago by Andrew Clegg310
11
gravatar for brentp
7.0 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

if you have pyfasta installed:

$ pyfasta split -n 1 -k 60 chr1.fa

does what you want, with headers like

>chr1_1
>chr1_61

you can also split into multiple files and with overlapping sequence like:

$ pyfasta split -n 3 -k 60 -o 20 chr1.fa

which will make 60 (-k) mers where each overlaps the previous by 20 (-o) bp and they will be split as evenly as possible among 3 (-n) files.

ADD COMMENTlink written 7.0 years ago by brentp22k
1

Cool! The best code is always the one that is already written!

ADD REPLYlink written 7.0 years ago by Istvan Albert ♦♦ 73k

I think this demonstrates the power of your brainchild :) I'll spend some time learning it and digging through the code. Thanks!

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

@eric, the code responsible for that functionality could use a lot of cleanup. if you look through it feel free to fix up and send a pull request via github: http://github.com/brentp/pyfasta :)

ADD REPLYlink written 7.0 years ago by brentp22k

@brentp, I guess you are refering only to the code in the split_fasta.py file? (this file's only import is Fasta from pyfasta.py, which I guess should be to your liking).

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

@eric yes, that's it.

ADD REPLYlink written 7.0 years ago by brentp22k
10
gravatar for Lars Juhl Jensen
7.0 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

Take 2 for a Perl command line script:

perl -lne '$s.=$_ unless /^>/; if(/^>/||eof){$i=0;$s=~s/(.{1,60})/printf("%s_%05d\n%s\n",$n,++$i,$1);/ge; $n=$_;}' in.fa > out.fa

In contrast to take 1, this one actually works. Runtime is 13 seconds for human chromosome 1. The number 60 inside the script is the length of subsequences to be extracted.

Edit: It just occurred to me that it can be done slightly less ugly. Take 3:

perl -ne 'BEGIN{$/=">"}{s/(.*)//;$n=$1;s/\n//g;$i=0;s/(.{1,60})/printf(">%s_%05d\n%s\n",$n,++$i,$1);/ge;}' in.fa > out.fa
ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Lars Juhl Jensen11k

@Lars Thanks for the answer (even if I don't understand it yet... I only started Perl this week end ;) Please consider editing your posts rather than deleting them and posting a new one. Cheers!

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

Thanks - I thought that deleting would delete immediately. And when it didn't, I assumed that undelete could undelete it ... but no, I could only vote for undeleting. Not exactly intuitive if you ask me ;)

ADD REPLYlink written 7.0 years ago by Lars Juhl Jensen11k

Thanks - I thought that deleting would delete immediately. And when it didn't, I assumed that undelete could undelete it ... but no, I could only vote for undeleting. Not exactly intuitive if you ask me ;) Had I know that, I would obviously have edited instead.

ADD REPLYlink written 7.0 years ago by Lars Juhl Jensen11k
8
gravatar for Neilfws
7.0 years ago by
Neilfws47k
Sydney, Australia
Neilfws47k wrote:

My last solution: the EMBOSS program named splitter:

splitter -sequence myfile.fa -size 60 -outseq outfile.fa

Again, header lines look like ">XYZ_gene_1-60 (length = 449)".

ADD COMMENTlink written 7.0 years ago by Neilfws47k

Nice. Never knew about that one, thanks.

ADD REPLYlink written 7.0 years ago by Aaronquinlan9.8k

that is nice! i'm going to check out all the EMBOSS tools now.

ADD REPLYlink written 7.0 years ago by brentp22k
5
gravatar for Mitch Skinner
7.0 years ago by
Mitch Skinner650
Emeryville, CA
Mitch Skinner650 wrote:

Haskell

I'm still pretty much a newbie to Haskell, although I have been learning a lot these past two weeks. The first version of this that I wrote was crazy slow, and used a lot of memory to boot. But this version runs in constant space--I ended up re-writing it to use explicit tail recursion, and I'm once again using lazy I/O to keep the core code purely functional.

Speed wise, since all of the running time measurements on different machines can't really be compared, we could instead compare the running times of our solutions to the speed of a program that we've all got installed. If I send the output of my program to /dev/null, then it has similar I/O characteristics to 'wc', so I'd like to suggest that as a measuring stick.

At a digest size of 60 on chr1.fa, this code (writing to /dev/null) runs in a little over 15 seconds on my machine, which is about 1.6 times the time it takes to run 'wc' on the chr1.fa file. It supports multiple-sequence files and appears to handle most of the corner cases right.

edit: There's something I still don't get about lazy IO with file handles. It works just fine if I use the built-in "interact" (convenience function for taking input from stdin and writing to stdout), though. My original posted version was losing part of the end of the output, but this one doesn't. This version's run-time is almost 1.7 times the wc time.

[?]

Example:

[?]

Thanks for the problem, Eric!

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Mitch Skinner650

I just realized I was forgetting to close the file handles! Ha, I'm a h4sk3ll n00b.

ADD REPLYlink written 7.0 years ago by Mitch Skinner650
3
gravatar for Rajarshi Guha
7.0 years ago by
Rajarshi Guha850
United States
Rajarshi Guha850 wrote:

BioPython based code - not particularly memory efficient (for that I'd process the fasta file directly, rather than go via a toolkit)

from Bio import SeqIO, Seq
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import sys

def chunk(s, n):
    for i in range(0, len(s), n): yield s[i:i+n]

def process(filename, chunkSize):
    seqs = SeqIO.parse(open(filename), 'fasta')
    ohandle = open('output.fasta', 'w')
    for seq in seqs:
        chunks = []
        n = 1
        for substr in chunk(seq.seq.tostring(), chunkSize):
            chunks.append(SeqRecord( Seq(substr), id='%s_%05d' % seq.id, n), description=seq.description ))
            n += 1
    SeqIO.write(chunks, ohandle, 'fasta')
    ohandle.close()

if __name__ == '__main__':
    process(sys.argv[1], int(sys.argv[2]))
ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Rajarshi Guha850

+1 However, please make the lenght (here 60) sys.argv[2]) ;)

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

+1 However, please make the lenght (here 60) sys.argv[2] and ohandle use sys.argv[3] ;)

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k
3
gravatar for Neilfws
7.0 years ago by
Neilfws47k
Sydney, Australia
Neilfws47k wrote:

If you install the auxiliary scripts when you install Bioperl, you'll find a script for this purpose named "bp_split_seq". Run it like this (assuming chunk size = 60):

bp_split_seq -c 60 -f myfile.fa

You can also specify overlaps and there's an option to index the output sequences.

It does not quite do what's asked in the question: split sequences are stored separately in a directory named split, with names such as "XYZ_gene_1-61.faa" and headers like ">XYZ_gene_121-181 (length = 449)", but it's easy enough to concatenate them if required.

ADD COMMENTlink written 7.0 years ago by Neilfws47k

NOTE: Please consider leaving a comment if you downvote an answer, cheers!

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k

Someone is going after several of my answers in rapid succession; guess I offended them :-)

ADD REPLYlink written 7.0 years ago by Neilfws47k

Still very interesting and the output is almost similar to what pyfasta does in terms of the sub-sequences naming convention. Thanks!

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k
3
gravatar for Will
7.0 years ago by
Will4.4k
United States
Will4.4k wrote:

A python itertools solution capable of dealing with virtually infinite sized fasta file.

import sys
from itertools import chain, groupby, imap, islice

def fasta_iter(handle):
    """Yields headers and sequences as generators."""
    line_iter = imap(lambda x: x.strip(), handle)
    for is_header, group in groupby(line_iter, lambda x: x.startswith('>')):
        if is_header:
            header = group.next()
        else:
            seq = chain.from_iterable(group)
            yield header, seq

def take(n, iterable):
    return list(islice(iterable, n))

def digest(in_file, width, out_file):

    for header, seq in fasta_iter(in_file):
        siter = iter(seq)       
        block = ''.join(take(width, siter))
        c = 0
        while len(block):
            c += 1
            print c
            out_file.write('%s_%06d\n%s\n' % (header, c, block))    
            block = ''.join(take(width, siter))

if __name__ == '__main__':
    args = sys.argv[1:]
    ifile = args[0]
    width = int(args[1])
    ofile = args[2]
    with open(ifile) as ihandle:
        with open(ofile, 'w') as ohandle:
            digest(ihandle, width, ohandle)
ADD COMMENTlink written 7.0 years ago by Will4.4k
3
gravatar for Pierre Lindenbaum
7.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum96k wrote:

Re-inventing the wheel is FUN :-) My solution using GNU/Flex + C:

    %{
    #include <stdio.h>
    #include <stdlib.h>
    #include <errno.h>
    #include <string.h>
    #include <assert.h>

    #define MIN(a,b) (a<b?a:b)

    static char* title=NULL;
    static char* buffer=NULL;
    static unsigned long fragment_size=0L;
    static unsigned long buffer_size=0;
    static int sequence_index=0;

    static void overflow()
        {
        ++sequence_index;
        fprintf(stdout,"%s_%05d\n",title,sequence_index);
        fwrite(buffer,sizeof(char),buffer_size,stdout);
        fputc('\n',stdout);
        buffer_size=0;  
        }

    %}
    %option never-interactive
    %option noyywrap
    %%
    ^>.*\n   {
         if(buffer_size>0) overflow();
         title=realloc(title,sizeof(char)*yyleng);
         if(title==NULL)
        {
        fprintf(stderr,"Out of memory\n");
        exit(EXIT_FAILURE);
        }
         strncpy(title,yytext,yyleng-1);
         sequence_index=0;
         }

    ^[A-Za-z]+  {
        int nCopy=0;
        while(nCopy!=yyleng)
            {
            int len=MIN(fragment_size - buffer_size,yyleng-nCopy);
            memcpy(&buffer[buffer_size],&yytext[nCopy],len);
            buffer_size+=len;
            nCopy+=len;
            if(buffer_size==fragment_size)
                {
                overflow();
                }
            }
        }

    [ \t\n] ;

    .   {
        fprintf(stderr,"Illegal character  \"%s\".", yytext);
        exit(EXIT_FAILURE);
        }

    <<EOF>> {
        if(buffer_size >0) overflow();
        free(title);
        title=NULL;
        buffer_size=0;
        return 0;
        }

    %%

    int main(int argc,char** argv)
        {
        int optind=1;

        while(optind<argc)
            {
            if(strcmp(argv[optind],"-h")==0)
                {
                fprintf(stderr,"Options:");
                fprintf(stderr," -n <int> chunk size");
                return EXIT_SUCCESS;
                }
            else if(strcmp(argv[optind],"-n")==0)
                {
                fragment_size=atol(argv[++optind]);
                }
            else if(strcmp(argv[optind],"--")==0)
                {
                ++optind;
                break;
                }
            else if(argv[optind][0]=='-')
                {
                fprintf(stderr,"bad options \"%s\".\n",argv[optind]);
                return  EXIT_FAILURE;
                }
            else
                {
                break;
                }
            ++optind;
            }
        if(fragment_size<=0UL)
            {
            fprintf(stderr,"bad size: %ld\n",fragment_size);
            exit(EXIT_FAILURE);
            }
        buffer=malloc(sizeof(char)*fragment_size);
        if(buffer==NULL)
            {
            fprintf(stderr,"Out of memory\n");
            exit(EXIT_FAILURE);
            }
        if(optind==argc)
            {
            yylex();
            }
        else
            {
            while(optind< argc)
                {
                char* filename=argv[optind++];
                errno=0;            
                yyin=fopen(filename,"r");
                if(yyin==NULL)
                    {
                    fprintf(stderr,"Cannot open %s. %s\n",filename,strerror(errno));
                    exit(EXIT_FAILURE);
                    }
                yylex();
                fclose(yyin);
                }
            }
        free(buffer);
        return 0;
        }

Compilation:

flex -f golf.l
gcc -Wall -O3 lex.yy.c

Test:

time ./a.out -n 60 chromosomes/chr1.fa > /dev/null 
real    0m2.280s
user    0m2.176s
sys 0m0.108s
ADD COMMENTlink written 7.0 years ago by Pierre Lindenbaum96k
4

It's only re-inventing if it isn't better, or faster, or cooler, or shinier. If I've understood MTV correctly, I believe what you've done there is pimped the wheel ;)

ADD REPLYlink written 7.0 years ago by Cassj1.3k
1

Ah, that's a markdown thing, right? I didn't realize--I've just been sticking [?] tags in my answers and manually escaping the angle brackets. I ran the gist and it takes just over 3 seconds on my machine, or a bit more than five times faster than my haskell version. Nice! I don't get why 'wc' takes three times longer than your flex code; I've been using it as a yarstick of something minimally-cpu-intensive that has to read the whole input but maybe it's not a good example. Maybe it's doing some kind of unicode or locale-based processing? I don't know.

ADD REPLYlink written 7.0 years ago by Mitch Skinner650

Crikey, speedy. Memory usage?

ADD REPLYlink written 7.0 years ago by Andrew Clegg310

@andrew memory usage = ~(buffered) size of the segment ("-n") + the size of the fasta title.

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum96k

Is this the whole thing? flex is giving me the error "golf.l:144: premature EOF"

ADD REPLYlink written 7.0 years ago by Mitch Skinner650

Mitch, did you remove the whitespaces on the left margin ?

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum96k

I put the code in GIST: http://gist.github.com/496309

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum96k

Ah, that's a markdown thing, right? I didn't realize--I've just been sticking [?]pre[?] tags in my answers and manually escaping the angle brackets. I ran the gist and it takes just over 3 seconds on my machine, or a bit more than five times faster than my haskell version. Nice! I don't get why 'wc' takes three times longer than your flex code; I've been using it as a yarstick of something minimally-cpu-intensive that has to read the whole input but maybe it's not a good example. Maybe it's doing some kind of unicode or locale-based processing? I don't know.

ADD REPLYlink written 7.0 years ago by Mitch Skinner650
3
gravatar for Michael Dondrup
7.0 years ago by
Bergen, Norway
Michael Dondrup42k wrote:

And here the R & Bioconductor solution using the Biostrings package (edited after asking the BioC mailling list and thanks to getting helpful response). The crucial point was actually to get an efficient solution for writing FASTA files working, so there are two solutions.

1.: the "official" solution a bit simplified (only read 1. entry) to make the steps more clear:

require(Biostrings)
digest.fasta <-function(infas, size=60, outfas) {
  dnasts <- read.DNAStringSet(file=infas)
  dnaviews <- Views(dnasts[[1]], 
              successiveIRanges(
              rep(size, ceiling(length(dnasts[[1]]) / size))))

  write.XStringSet(as(dnaviews, "DNAStringSet"), file=outfas))
}

This is pretty much the same as my first attempt, but now I was told it works within R-devel (1.12.0) and the development version of Bioconductor, so this has been improved pretty much. Up to Biostrings_2.16.9 it's very slow though.


2.: For the time being we need a replacement for the writeFASTA (as slow as the write.XStringSet) function. This time, made only from basic language ingredients:

writeFASTA.2 <- function(sequences, desc=names(sequences), width=80, file=stdout(), append=FALSE) {
  ## convert whatever sequence object to character vector
  sequences <- as.character(sequences)
  if(!is.null(desc) && length(sequences) != length(desc))
     stop("wrong length of 'desc'")
  ## if there are no names, make them
  desc <- if (is.null(desc))
    paste(">", seq(along=sequences))
  else
    paste(">", desc, "\n", sep="") 
  ## Adjust line width or add a \n 
  sequences <-  if (! is.null(width))
    gsub( sprintf("(.{1,%d})", width), "\\1\n", sequences )
  else
    paste(sequences, "\n", sep="")
  ## paste is faster than sprintf, print the output
  cat(paste(desc,sequences, sep="", collapse=""), sep="", file=file, append=append )
}

This function can also handle to adjust the lengths of the FASTA lines. However, if this is not needed, then this function is faster:

writeFASTA.4 <- function(dna,  desc=names(dna),  file=stdout()) {
  if (is.null(desc))
    desc <- paste(seq(along=dna))
  fasta = character(2 * length(dna))
  fasta[c(TRUE, FALSE)] <- paste(">", desc, sep="")
  fasta[c(FALSE, TRUE)] <- as.character(dna)
  writeLines(fasta, file)
}

With the whole thing looking like this:

digest.fasta <-function(infas, size=60, outfas) {
  dnasts <- read.DNAStringSet(file=infas)
  dnaviews <- Views(dnasts[[1]], 
               successiveIRanges(
               rep(size, ceiling(length(dnasts[[1]]) / size))))

  writeFASTA.4(dnaviews, file=outfas)
}

> system.time (digest.fasta(infas="chr1.fa", size=60, outfas=tempfile()) )
   user  system elapsed 
 73.345   1.396  74.815
ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Michael Dondrup42k
3
gravatar for Haibao Tang
7.0 years ago by
Haibao Tang2.9k
Richmond, CA
Haibao Tang2.9k wrote:

This is in line with brentp's "not-invent-wheels" mentality. In the past I have used faSplit from Kent source, it is quite fast. Usage:

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.
Files split by sequence will be broken at the nearest fa record boundary.
Files split by base will be broken at any base.
Files broken by size will be broken every count bases.

So for the question, the command might look like this:

faSplit size input.fa 60 outRoot
ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Haibao Tang2.9k
2
gravatar for Neilfws
7.0 years ago by
Neilfws47k
Sydney, Australia
Neilfws47k wrote:

A quick attempt in BioRuby just before bed - I may need to edit this in about 9 hours :-)

#!/usr/bin/ruby
require "rubygems"
require "bio"

n = ARGV[1].nil? ? 60 : ARGV[1].to_i

Bio::FlatFile.open(ARGV[0]) do |ff|
  ff.each do |seq|
    1.step(seq.length, n) { |s|
      puts ">#{seq.entry_id}_#{(s / n).to_s.rjust(5, "0")}", seq.seq[s..(s + (n - 1))]
    }
  end
end

Save and run as, e.g. for N = 70:

digest_fasta.rb myfile.fa 70 > outfile

It takes the input filename as first argument. Subsequence length is optionally the second argument or 60 by default. Then we read the fasta file, use step to get the subsequences and print out a new header + sequence, padding the sequence count with 5 zeros. Of course for a huge file, you'd need more than 5 zeros, but I guess you could calculate the required number.

There's a similar example on the BioRuby tutorial page (search for "Divide a genome sequence into sections of 10000bp") - it uses the window_search() method, but has to deal with remainders.

No idea how fast this runs for Chr1 - something for tomorrow. Oh - and as it stands, does not wrap long sequence lines.

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Neilfws47k

I'm sure this had a vote a minute ago :-) like Eric says, comments are appreciated.

ADD REPLYlink written 7.0 years ago by Neilfws47k
2
gravatar for Aaronquinlan
7.0 years ago by
Aaronquinlan9.8k
United States
Aaronquinlan9.8k wrote:

Assuming you have faToTab from kent src installed, you can combine it with awk.

Change the "len" variable in the BEGIN block accordingly.

faToTab chr1.fa stdout | awk 'BEGIN {len=60} { \
                            for (i=0; i<length($2); i+=len) \
                            { \
                              print ">"$1"_"(i/len)+1"\n"substr($2, i, len) \
                            } \
                         }' > out

Not terribly fast (19s), but simple as it can easily be dropped into a one line bash script that takes "len" as an argument.

Aaron

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Aaronquinlan9.8k

Could someone comment on why this was down-voted? Do you believe it to be incorrect?

ADD REPLYlink written 7.0 years ago by Aaronquinlan9.8k

I think someone is on a down-voting spree. I had 2 down here and 2 at another question in a short time span. Perhaps they are offended :-)

ADD REPLYlink written 7.0 years ago by Neilfws47k
2
gravatar for Brad Chapman
7.0 years ago by
Brad Chapman9.1k
Boston, MA
Brad Chapman9.1k wrote:

Here's a Clojure solution to the problem. Like Mitch, I appreciate the ideas; it's a fun chance to practice in a new language.

The tricky part with Clojure was making this fully lazy for memory efficiency while trying to remain general with the implementation. This uses the line based FASTA parser from the University of Tokyo Genome Browser Development Toolkit to handle the underlying IO and then is careful to only grab lines as needed, which scales to both big sequences and files with lots of little sequences.

(import '(org.utgenome.format.fasta FASTAPullParser))

(use '[clojure.java.io]
     '[clojure.contrib.duck-streams :only (with-out-writer)]
     '[clojure.contrib.str-utils2 :only (join)]
     '[clojure.contrib.seq :only (indexed)])

(defn seq-iterator [parser]
  "Lazily retrieve sequence lines one at a time from the FASTA parser."
  (lazy-seq
    (when-let [line (.nextSequenceLine parser)]
      (cons line (seq-iterator parser)))))

(defn partition-seq [parser piece-size]
  "Break up a sequence collection into indexed pieces, pulling lines as needed."
  (->> (seq-iterator parser)
    (mapcat seq)
    (partition-all piece-size)
    indexed))

(defn write-fasta-pieces [in-file piece-size out-file]
  (with-out-writer out-file
    (let [parser (new FASTAPullParser (file in-file))
          piece-size (. Integer parseInt piece-size)]
      (loop [work-parser parser]
        (when-let [cur-id (.nextDescriptionLine work-parser)]
          (doseq [[i s] (partition-seq work-parser piece-size)]
            (println (format ">%s_%s\n%s" cur-id (+ i 1) (join "" s)))
              )
          (recur work-parser))))))

(when *command-line-args*
  (apply write-fasta-pieces *command-line-args*))
ADD COMMENTlink written 7.0 years ago by Brad Chapman9.1k
1
gravatar for Rajarshi Guha
7.0 years ago by
Rajarshi Guha850
United States
Rajarshi Guha850 wrote:

OK, another Python version, written for a fa file with a single seq (but still not memory efficient). Runs in 13s

import sys

def chunk(s, n):
    for i in range(0, len(s), n): yield s[i:i+n]

def process2(filename, chunkSize):
    f = open(filename, 'r')
    f.readline()
    d = f.read().replace("\n", "")
    n = 1
    for substr in chunk(d, chunkSize):
        print '>id_%05d\n%s' % (n, substr)
        n += 1

if __name__ == '__main__':
    process2(sys.argv[1], 60)
ADD COMMENTlink written 7.0 years ago by Rajarshi Guha850

NOTE: Please consider leaving a comment if you downvote an answer, cheers!

ADD REPLYlink written 7.0 years ago by Eric Normandeau9.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 966 users visited in the last hour