Question: splitting Multifasta File Into A *Smaller Multifasta File containing few sequence
0
gravatar for kabir.deb0353
2.3 years ago by
kabir.deb03530 wrote:

I am trying to split a large multifasta file into several smaller mutlifasta files. I have seen several examples of being used to split multifasta files into single fasta, but for not more than one fasta in a single file. Particularly I'm trying to split a fasta file containing 1300 sequences into 100 small files having sequentially 13 sequences in each (same accession number). Hoping for good suggestions. Thanks in advance.

ADD COMMENTlink modified 2.3 years ago by zjhzwang180 • written 2.3 years ago by kabir.deb03530

This post may be helpful.

how to convert a long fasta-file into many separate single fasta sequences

ADD REPLYlink written 2.3 years ago by natasha.sernova3.4k
1
gravatar for Brian Bushnell
2.3 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Your description is not completely clear in terms of what you mean by "same accession number"; does everything have the same accession, or are there 100 different accessions, each with 13 sequences? Regardless, you can use the BBMap package like this:

partition.sh in=file.fasta out=out_%.fasta ways=100

That will give you 100 fastas, each with an equal number of sequences. Alternatively, if you want to split by accession number:

demuxbyname.sh in=file.fasta out=out_%.fasta names=names.txt substringmode

If all of your accessions are listed in "names.txt", this will produce one file per accession, containing all the sequences labelled with that accession. Alternately, if the access is the (for example) first 12 characters of the sequence names, and you don't have a list of accessions, you could do this:

demuxbyname.sh in=file.fasta out=out_%.fasta prefixmode length=12
ADD COMMENTlink written 2.3 years ago by Brian Bushnell16k
0
gravatar for Vitis
2.3 years ago by
Vitis2.1k
New York
Vitis2.1k wrote:

Try using BioPerl or BioPython modules handling FASTA, you can create tools to do all kinds of manipulation to FASTA files.

http://bioperl.org/howtos/SeqIO_HOWTO

http://biopython.org/wiki/Documentation

ADD COMMENTlink written 2.3 years ago by Vitis2.1k
0
gravatar for Prasad
2.3 years ago by
Prasad1.5k
India
Prasad1.5k wrote:

try UCSC utility, faSplit has many options to split a fasta file. For your case

faSplit sequence input.fasta 100 out.fasta

where 100 is input file will be split into 100 files

ADD COMMENTlink written 2.3 years ago by Prasad1.5k
0
gravatar for kabir.deb0353
2.3 years ago by
kabir.deb03530 wrote:

Actually, I was using "pyfasta split -n 100 input.fasta".

But problem is that by doing this way 100 different files are generating randomly. which is not exactly what I'm looking for; yes Brian, my large multi fasta file have 100 different accessions, each with 13 sequences sequentially and total 1300 sequence in the input.fasta file. means Acc. no. NC_12345 has 13 fasta sequences in a row then the next Acc. no. NC_12346 has 13 fasta sequences and so on for 100 accessions. Now I need to just sepearate it out according to accession or first 13 sequences in each file.

ADD COMMENTlink written 2.3 years ago by kabir.deb03530

OK, either of the two demuxbyname commands will work, then. The first if you have a list of accessions in a text file; or the second if you don't, but the headers start with the accession.

Also, theoretically, you could do this:

partition.sh in=file.fasta out=temp_%.fasta ways=13
cat temp_*.fasta > catted.fasta
partition.sh in=file.fasta out=final_%.fasta ways=100

That would give the first 13 records in final_0.fasta, the second 13 records in final_1.fasta, etc. through final_99.fasta. It would work regardless of the names, if you had exactly 1300 records.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Brian Bushnell16k
0
gravatar for kabir.deb0353
2.3 years ago by
kabir.deb03530 wrote:

Oh thanks Brian, thank you very much; finally using demuxbyname the problem has been solved...

ADD COMMENTlink written 2.3 years ago by kabir.deb03530
0
gravatar for mks002
2.3 years ago by
mks002160
Bangalore. India
mks002160 wrote:

Run the perl code below and you will get desired result.

perl fasta_per_line.pl fasta_file_name 100

$file=$ARGV[0];
$f_size=$ARGV[1];
chomp($file);
chomp($f_size);

if (!$file || !$f_size)
{
    die "\nUSAGE: perl fasta_per_line.pl <file_name> <Number>\n\n<file_name>- Multiple fasta file name\n<Number>- Number of fasta sequences to be put in each new file (Must be less than total number of sequnecs in input file)\n\n";
}

%seq_hash=();
open AA, "<$file";

foreach (<AA>)
{
    chomp($_);
    if ($_=~/^>(.*)/)
    {       
        $id=$1;
        push (@aaa, $id);
    }
    else
    {
        $seq_hash{$id} .=$_;
    }
}

$file_count=1;$seq_count=1;
open KK, ">$file.1";
foreach $a(@aaa)
{
    print KK ">".$a."\n", $seq_hash{$a}."\n";
    $seq_count++; 
    if ($seq_count>$f_size)
    {
        close(KK); $file_count++; $seq_count=1;
    }
    if ($seq_count == 1)
    {
        $name=$file_count.$file;
        open KK, ">$name";
    }
}
close(KK);
ADD COMMENTlink written 2.3 years ago by mks002160
0
gravatar for zjhzwang
2.3 years ago by
zjhzwang180
zjhzwang180 wrote:

You can use one sample DOS command{split},and the usage is :

split [-n] file [name]

-n is the line numbers one file contain.
[name] is the result files start name,for example,if test,the result files will be testaa.txt,testab.txt...

Wish it will be usefull for you.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by zjhzwang180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1540 users visited in the last hour