How to make fasta manipulation more efficient
2
0
Entering edit mode
7 weeks ago
c_u ▴ 480

I have a Multi-Sequence FASTA file of the form -

>SDF123.1 blah blah 
ATCTCTGGAAACTCGGTGAAAGAGAGTAT
AGTGATGAGGATGAGTGAG...
>SBF123.1 blah blah 
ATCTCTGGAAACTCGGTGAAAGAGAGTAT
AGTGATGAGGATGAGTGAG....

And I want to extract the individual FASTA files into individual files (like here)

I wrote the following AWK code, but it runs too slow, as compared to when I did not have the close command in it. By slow, I mean it only generates about a dozen files in a minute. I had to incorporate the close command, since without it, I was getting the awk error - too many open files.

Here is the code -

cat big_multi_sequence_file.fasta | awk -F ' ' '{
        if (substr($0, 1, 1)==">") {filename=(substr($1,2) ".fa")}
        print $0 >> filename; close (filename)
}'

How can I make this code more time efficient? I am new to awk.

Thank you!

fasta sequence • 664 views
ADD COMMENT
0
Entering edit mode

If you are willing to try other solutions then faSplit (LINK for UNIX version) by Jim Kent is probably going to be one of the most efficient options. It is available for Linux/macOS. Be sure to add execute permissions (chmod a+x faSplit).

ADD REPLY
3
Entering edit mode
7 weeks ago
csplit input.fasta '%^>%' '/^>/'  "{*}"
ADD COMMENT
0
Entering edit mode

I'm on a Mac, so this doesnt work but I found a way that it can be made to work in a Mac. Thank you!

ADD REPLY
0
Entering edit mode

If you encountered a problem with the csplit tool and figured out a solution, please share the problem and the solution so others can benefit from your work on this.

ADD REPLY
1
Entering edit mode

Sure. So the issue is that {*} doesn't work in Mac, so I found a solution here wherein you can install coreutils on Mac using Homebrew, and then use gcsplit to emulate csplit

ADD REPLY
1
Entering edit mode

Ah, I didn't know that csplit was a coreutils utility. macOS's BSD coreutils suck big time, and I've always found it helpful to install GNU-coreutils without prefixes so I can use grep, sed, tar (and now csplit) without adding the g (ggrep, gsed, gtar, gcsplit). BSD's sed especially sucks IMO

ADD REPLY
0
Entering edit mode

Interesting, will try that. Thanks!

ADD REPLY
0
Entering edit mode
7 weeks ago
Congcong • 0

You should use seqkit

https://bioinf.shenwei.me/seqkit/

ADD COMMENT

Login before adding your answer.

Traffic: 2639 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6