Question

How to make fasta manipulation more efficient

0

Entering edit mode

2.6 years ago

c_u ▴ 520

I have a Multi-Sequence FASTA file of the form -

>SDF123.1 blah blah 
ATCTCTGGAAACTCGGTGAAAGAGAGTAT
AGTGATGAGGATGAGTGAG...
>SBF123.1 blah blah 
ATCTCTGGAAACTCGGTGAAAGAGAGTAT
AGTGATGAGGATGAGTGAG....

And I want to extract the individual FASTA files into individual files (like here)

I wrote the following AWK code, but it runs too slow, as compared to when I did not have the close command in it. By slow, I mean it only generates about a dozen files in a minute. I had to incorporate the close command, since without it, I was getting the awk error - too many open files.

Here is the code -

cat big_multi_sequence_file.fasta | awk -F ' ' '{
        if (substr($0, 1, 1)==">") {filename=(substr($1,2) ".fa")}
        print $0 >> filename; close (filename)
}'

How can I make this code more time efficient? I am new to awk.

Thank you!

fasta sequence • 1.7k views

ADD COMMENT • link 2.6 years ago by c_u ▴ 520

0

Entering edit mode

If you are willing to try other solutions then faSplit (LINK for UNIX version) by Jim Kent is probably going to be one of the most efficient options. It is available for Linux/macOS. Be sure to add execute permissions (chmod a+x faSplit).

ADD REPLY • link 2.6 years ago by GenoMax 141k

0

Entering edit mode

2.6 years ago

Congcong • 0

You should use seqkit

https://bioinf.shenwei.me/seqkit/

ADD COMMENT • link updated 2.6 years ago by Istvan Albert 100k • written 2.6 years ago by Congcong • 0

Ram · Accepted Answer · 2021-09-03

3

Entering edit mode

2.6 years ago

Pierre Lindenbaum 161k

csplit input.fasta '%^>%' '/^>/'  "{*}"

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 2.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I'm on a Mac, so this doesnt work but I found a way that it can be made to work in a Mac. Thank you!

ADD REPLY • link 2.6 years ago by c_u ▴ 520

0

Entering edit mode

If you encountered a problem with the csplit tool and figured out a solution, please share the problem and the solution so others can benefit from your work on this.

ADD REPLY • link 2.6 years ago by Ram 43k

1

Entering edit mode

Sure. So the issue is that {*} doesn't work in Mac, so I found a solution here wherein you can install coreutils on Mac using Homebrew, and then use gcsplit to emulate csplit

ADD REPLY • link 2.6 years ago by c_u ▴ 520

1

Entering edit mode

Ah, I didn't know that csplit was a coreutils utility. macOS's BSD coreutils suck big time, and I've always found it helpful to install GNU-coreutils without prefixes so I can use grep, sed, tar (and now csplit) without adding the g (ggrep, gsed, gtar, gcsplit). BSD's sed especially sucks IMO