splitting Multifasta File Into A *Smaller Multifasta File containing few sequence
7
0
Entering edit mode
4.9 years ago

I am trying to split a large multifasta file into several smaller mutlifasta files. I have seen several examples of being used to split multifasta files into single fasta, but for not more than one fasta in a single file. Particularly I'm trying to split a fasta file containing 1300 sequences into 100 small files having sequentially 13 sequences in each (same accession number). Hoping for good suggestions. Thanks in advance.

multifasta split small multifasta • 3.7k views
0
Entering edit mode
2
Entering edit mode
4.9 years ago

Your description is not completely clear in terms of what you mean by "same accession number"; does everything have the same accession, or are there 100 different accessions, each with 13 sequences? Regardless, you can use the BBMap package like this:

partition.sh in=file.fasta out=out_%.fasta ways=100


That will give you 100 fastas, each with an equal number of sequences. Alternatively, if you want to split by accession number:

demuxbyname.sh in=file.fasta out=out_%.fasta names=names.txt substringmode


If all of your accessions are listed in "names.txt", this will produce one file per accession, containing all the sequences labelled with that accession. Alternately, if the access is the (for example) first 12 characters of the sequence names, and you don't have a list of accessions, you could do this:

demuxbyname.sh in=file.fasta out=out_%.fasta prefixmode length=12

0
Entering edit mode
4.9 years ago
Vitis ★ 2.5k

Try using BioPerl or BioPython modules handling FASTA, you can create tools to do all kinds of manipulation to FASTA files.

http://bioperl.org/howtos/SeqIO_HOWTO

http://biopython.org/wiki/Documentation

0
Entering edit mode
4.9 years ago

try UCSC utility, faSplit has many options to split a fasta file. For your case

faSplit sequence input.fasta 100 out.fasta


where 100 is input file will be split into 100 files

0
Entering edit mode
4.9 years ago

Actually, I was using "pyfasta split -n 100 input.fasta".

But problem is that by doing this way 100 different files are generating randomly. which is not exactly what I'm looking for; yes Brian, my large multi fasta file have 100 different accessions, each with 13 sequences sequentially and total 1300 sequence in the input.fasta file. means Acc. no. NC_12345 has 13 fasta sequences in a row then the next Acc. no. NC_12346 has 13 fasta sequences and so on for 100 accessions. Now I need to just sepearate it out according to accession or first 13 sequences in each file.

0
Entering edit mode

OK, either of the two demuxbyname commands will work, then. The first if you have a list of accessions in a text file; or the second if you don't, but the headers start with the accession.

Also, theoretically, you could do this:

partition.sh in=file.fasta out=temp_%.fasta ways=13
cat temp_*.fasta > catted.fasta
partition.sh in=file.fasta out=final_%.fasta ways=100


That would give the first 13 records in final_0.fasta, the second 13 records in final_1.fasta, etc. through final_99.fasta. It would work regardless of the names, if you had exactly 1300 records.

0
Entering edit mode
4.9 years ago

Oh thanks Brian, thank you very much; finally using demuxbyname the problem has been solved...

0
Entering edit mode
4.9 years ago
mks002 ▴ 200

Run the perl code below and you will get desired result.

perl fasta_per_line.pl fasta_file_name 100

$file=$ARGV[0];
$f_size=$ARGV[1];
chomp($file); chomp($f_size);

if (!$file || !$f_size)
{
die "\nUSAGE: perl fasta_per_line.pl <file_name> <Number>\n\n<file_name>- Multiple fasta file name\n<Number>- Number of fasta sequences to be put in each new file (Must be less than total number of sequnecs in input file)\n\n";
}

%seq_hash=();
open AA, "<$file"; foreach (<AA>) { chomp($_);
if ($_=~/^>(.*)/) {$id=$1; push (@aaa,$id);
}
else
{
$seq_hash{$id} .=$_; } }$file_count=1;$seq_count=1; open KK, ">$file.1";
foreach $a(@aaa) { print KK ">".$a."\n", $seq_hash{$a}."\n";
$seq_count++; if ($seq_count>$f_size) { close(KK);$file_count++; $seq_count=1; } if ($seq_count == 1)
{
$name=$file_count.$file; open KK, ">$name";
}
}
close(KK);

0
Entering edit mode
4.9 years ago
zjhzwang ▴ 180

You can use one sample DOS command{split},and the usage is :

split [-n] file [name]


-n is the line numbers one file contain.
[name] is the result files start name,for example,if test,the result files will be testaa.txt,testab.txt...

Wish it will be usefull for you.