Question: Splitting A Fasta File
2
gravatar for andysw90
7.2 years ago by
andysw9020
andysw9020 wrote:

Hi all,

I have a FASTA file which contains protein sequences of a load of genes from D. melanogaster, and need to split the file into multiple FASTAs, one gene per file. What's the best way to go about this? Ideally each file will be named with the name of the gene (UniProt ID).

Example, the sequence below wants to be split into 2 files, the first called O46197.fasta, the second called Q9VUQ5.fasta

Many thanks!

>sp|O46197|A29AB_DROME Accessory gland protein Acp29AB OS=Drosophila melanogaster GN=Acp29AB PE=2 SV=2
MYASNLLYLLALWNLWDLSGGQQDIPNGKATLPSPQTPQNTIDQIGINQNYWFTYNALKQ
NETLAIIDTMEMRIASSLLEFKAQMEIQLQPLKIIMRHHASNIKASNNIKMRRFEKVGSR
HFHIEKNLMQTWFEAYVTCRKMNGHLANIQDEMELDGILALAPNNSYWIDISKLVENGGT
FVSTLTGREPFFVKWKSNQDTKKKNQCVYIYAKEMSYDECFEKKSFVCQADQWA

.
>sp|Q9VUQ5|AGO2_DROME Protein argonaute-2 OS=Drosophila melanogaster GN=AGO2 PE=1 SV=3
MGKKDKNKKGGQDSAAAPQPQQQQKQQQQRQQQPQQLQQPQQLQQPQQLQQPQQQQQQQP
HQQQQQSSRQQPSTSSGGSRASGFQQGGQQQKSQDAEGWTAQKKQGKQQVQGWTKQGQQG
GHQQGRQGQDGGYQQRPPGQQQGGHQQGRQGQEGGYQQRPPGQQQGGHQQGRQGQEGGYQ
QRPSGQQQGGHQQGRQGQEGGYQQRPPGQQQGGHQQGRQGQEGGYQQRPSGQQQGGHQQG
RQGQEGGYQQRPPGQQQGGHQQGRQGQEGGYQQRPPGQQQGGHEQGRQGQEGGYQQRPSG
QQQGGHQQGRQGQEGGYQQRPSGQQQGGHQQGRQGQEGGYQQRPSGQQQGGHQQGRQGQE
GGYQQRPPGQQPNQTQSQGQYQSRGPPQQQQAAPLPLPPQPAGSIKRGTIGKPGQVGINY
LDLDLSKMPSVAYHYDVKIMPERPKKFYRQAFEQFRVDQLGGAVLAYDGKASCYSVDKLP
LNSQNPEVTVTDRNGRTLRYTIEIKETGDSTIDLKSLTTYMNDRIFDKPMRAMQCVEVVL
ASPCHNKAIRVGRSFFKMSDPNNRHELDDGYEALVGLYQAFMLGDRPFLNVDISHKSFPI
SMPMIEYLERFSLKAKINNTTNLDYSRRFLEPFLRGINVVYTPPQSFQSAPRVYRVNGLS
RAPASSETFEHDGKKVTIASYFHSRNYPLKFPQLHCLNVGSSIKSILLPIELCSIEEGQA
LNRKDGATQVANMIKYAATSTNVRKRKIMNLLQYFQHNLDPTISRFGIRIANDFIVVSTR
VLSPPQVEYHSKRFTMVKNGSWRMDGMKFLEPKPKAHKCAVLYCDPRSGRKMNYTQLNDF
GNLIISQGKAVNISLDSDVTYRPFTDDERSLDTIFADLKRSQHDLAIVIIPQFRISYDTI
KQKAELQHGILTQCIKQFTVERKCNNQTIGNILLKINSKLNGINHKIKDDPRLPMMKNTM
YIGADVTHPSPDQREIPSVVGVAASHDPYGASYNMQYRLQRGALEEIEDMFSITLEHLRV
YKEYRNAYPDHIIYYRDGVSDGQFPKIKNEELRCIKQACDKVGCKPKICCVIVVKRHHTR
FFPSGDVTTSNKFNNVDPGTVVDRTIVHPNEMQFFMVSHQAIQGTAKPTRYNVIENTGNL
DIDLLQQLTYNLCHMFPRCNRSVSYPAPAYLAHLVAARGRVYLTGTNRFLDLKKEYAKRT
IVPEFMKKNPMYFV
fasta sequence • 11k views
ADD COMMENTlink modified 7.2 years ago by bhojpuri song download10 • written 7.2 years ago by andysw9020

cross posted on http://seqanswers.com/forums/showthread.php?p=110623

ADD REPLYlink written 7.2 years ago by Pierre Lindenbaum130k

Duplicate of many previous questions; search this site for "split fasta".

ADD REPLYlink written 7.2 years ago by Neilfws48k
9
gravatar for Pierre Lindenbaum
7.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

try the following awk script:

awk -F '|' '/^>/ {F=sprintf("%s.fasta",$2); print > F;next;} {print >> F;}' < yourfile.fa
ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Pierre Lindenbaum130k
1

Hi Pierre, your Awk script could be simpler (I like simplicity!): awk -F "|" '/^>/ {F = $2".fasta"} {print > F}' yourfile.fa It reads like this: if the line starts with >, open a new file and write to it until you reach a new line starting with >. Note it is not necessary to use >> with Awk (the output file is only opened once).

ADD REPLYlink written 7.2 years ago by Frédéric Mahé3.0k

Hi Frédéric, both of these scripts work well, however I'm puzzled by how your solution works at all - Pierre's version has the next statement in it, but I'd expect yours to print just the names and not the sequence - but it does :D Do you care to explain where the magic happens please? I'm no expert on awk though, just an advanced copy-paster (meaning I can combine multiple copy-pastes to get stuff working :) ). Thanks

Anyway, this awk solution only gives me 1020 files (Pierre's version 1021) and then throws an error (Too many open files) while I have over 6000 contigs in my fasta. I can use bash split to divide this file into e.g. six smaller files with the default names and then use this to export the individual contigs, but it's less elegant and I want it to be usable by other people. Do you guys know of any simpler workaround?

What I actually want is to split a fasta file from our recent study into contigs with alignment for the 9 individuals and then put the solution on github with the rest of the analysis, for the sake of reproducible research ;)

ADD REPLYlink written 2.8 years ago by cicindel20
1

Hi cicindel,

awk is trying to open a file descriptor for each fasta entry, and you are reaching the maximum number of files that can be opened simultaneously. To avoid that, you just need to close a file as soon as you are done with it. Here is what you need to know about awk's close() command:

close() will silently do nothing if given an argument that does not represent a file, pipe or coprocess that was opened with a redirection.

So, closing the previous file before opening a new one should do the trick. Use >> if you think a file might be opened more than once: awk -F "|" '/^>/ {close(F) ; F = $1".fasta"} {print >> F}' yourfile.fa

ADD REPLYlink written 2.7 years ago by Frédéric Mahé3.0k

Oh cool, thanks - that seems to work exactly as I want it :)

ADD REPLYlink written 2.7 years ago by cicindel20
4
gravatar for SES
7.2 years ago by
SES8.4k
Vancouver, BC
SES8.4k wrote:

There is an EMBOSS program called seqretsplit that was designed exactly for this purpose. Also, there is a script distributed with BioPerl called bp_seqretsplit.pl that performs this same function and does not use EMBOSS. The BioPerl script is only a couple of lines of code and you can modify it easily to write out the type of identifiers you want. For example, this script will produce the desired output:

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new(-format => 'fasta',
                         -fh   => \*ARGV);
while( my $s = $in->next_seq ) {
    my ($id) = ($s->id =~ /^(?:\w+)\|(\S+)\|/);
    Bio::SeqIO->new(-format => 'fasta',
                    -file   => ">".$id.".fasta")->write_seq($s);
}

Run it like:

$ perl biostars76716.pl < myseqs.fas

Note that there are some issues with the fasta you posted that do not look right, but I'll assume that is just an artifact of formatting your post.

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by SES8.4k

Top stuff, cheers

ADD REPLYlink written 7.2 years ago by andysw9020
1
gravatar for KCC
7.2 years ago by
KCC4.0k
Cambridge, MA
KCC4.0k wrote:

I thought I would write a python version. I stored the following text in a file called splitfasta.py

import sys
infile = open(sys.argv[1])
outfile = []

for line in infile:
    if line.startswith(">"):
        if (outfile != []): outfile.close()
        genename = line.strip().split('|')[1]
        filename = genename+".txt"
        outfile = open(filename,'w')
        outfile.write(line)
    else:
        outfile.write(line)
outfile.close()

The usage is: python splitfasta.py input.fa

ADD COMMENTlink written 7.2 years ago by KCC4.0k

This is old, but which version of python did you use? With 2.7.2, 2.7.11, and 3.5.2 I get Traceback (most recent call last): File "splitfasta.py", line 13, in <module> outfile.write(line) AttributeError: 'list' object has no attribute 'write'

ADD REPLYlink written 17 months ago by hmg20
0
gravatar for bhojpuri song download
7.2 years ago by
bhojpuri song download10 wrote:

you may like to see couple of solution HERE

ADD COMMENTlink written 7.2 years ago by bhojpuri song download10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1789 users visited in the last hour