Question: Pyfasta Split By Header
0
gravatar for arronslacey
5.5 years ago by
arronslacey240
United Kingdom
arronslacey240 wrote:

HI I am trying to split a fasta file using pyfasta, and print to individual files, but I'm having some trouble understanding the syntax. I am using:

pyfasta split --header afp2test.fasta afp2test.fasta

this runs, but doesnt split the files.

Wjat is the syntax i need?

python • 3.2k views
ADD COMMENTlink modified 15 months ago by al-ash100 • written 5.5 years ago by arronslacey240

I guess you need to say with what you wanna split the fasta:

extract sequence from the file. use the header flag to make a new fasta file. the args are a list of sequences to extract.

$ pyfasta extract --header --fasta test/data/three_chrs.fasta seqa seqb seqc

extract sequence from a file using a file containing the headers not wanted in the new file:

$ pyfasta extract --header --fasta input.fasta --exclude --file seqids_to_exclude.txt

extract sequence from a fasta file with complex keys where we only want to lookup based on the part before the space.

$ pyfasta extract --header --fasta input.with.keys.fasta --space --file seqids.txt
ADD REPLYlink written 5.5 years ago by Phil S.660

but is there a way to do this without haveing to write down all the headers on the command line. i have a file with 100's of sequnces and i just want to make a new file for each sequence. the "split" command allows you to do this by specifiying the amount of files, but no matter what i do using this, the order does not seem to be preserved, and 1 file always contains 2 sequences. there is an option to split by header, but thought this would automatically pick up the individual headers and put them in their own files.

ADD REPLYlink written 5.5 years ago by arronslacey240
8
gravatar for Manu Prestat
5.5 years ago by
Manu Prestat3.9k
Marseille, France
Manu Prestat3.9k wrote:

GNU csplit is done for that kind of jobs:

csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}
ADD COMMENTlink written 5.5 years ago by Manu Prestat3.9k

+1, using correct tool and looks nice!

ADD REPLYlink written 5.5 years ago by Phil S.660
1
gravatar for Phil S.
5.5 years ago by
Phil S.660
Stuttgart, Germany
Phil S.660 wrote:

this will do the job:

 awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%1==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }'   < sequences.fa

it wil generate a file for ech of the sequences in your file 'sequences.fa'...

ADD COMMENTlink written 5.5 years ago by Phil S.660
0
gravatar for al-ash
15 months ago by
al-ash100
Japan/Okinawa/OIST
al-ash100 wrote:

See https://pypi.python.org/pypi/pyfasta/

split the fasta file into one new file per header with “%(seqid)s” being filled into each filename.:
$ pyfasta split –header “%(seqid)s.fasta” original.fasta

You need to specify that sequence id (= the name of your fasta within multifasta file) will be used as the name of the new files. For that, seqid parameter is used.

ADD COMMENTlink modified 15 months ago • written 15 months ago by al-ash100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1138 users visited in the last hour