How To Split A Multiple Fasta
12
24
Entering edit mode
13.7 years ago
Gvj ▴ 470

How to split a Multiple fasta file into separate files having almost similar file size as specified? Do you have any tool for that? But the tool shouldn't split individual fasta entry

Gvj

fasta split • 52k views
ADD COMMENT
17
Entering edit mode
13.7 years ago

I suggest the fastasplitn command. It works like a charm for me. pyfasta has similar functionality:

# split a fasta file into 6 new files of relatively even size:
pyfasta split -n 6 original.fasta
ADD COMMENT
9
Entering edit mode
12.3 years ago

The most efficient I know is a GenomeThreader tool (EDIT: genome tools actually):

If you want to constrain the number of output files (here 60):

gt splitfasta -numfiles 60 seqs.fasta

If you want to constrain the size in MB (here 20) of each output file:

gt splitfasta -targetsize 20 seqs.fasta
ADD COMMENT
0
Entering edit mode

The gt program is actually part of the GenomeTools library. The GenomeThreader tool (gth on the command line) is developed by the same group but is used for spliced alignment.

ADD REPLY
0
Entering edit mode

You are right ;-) I edit the answer now.

ADD REPLY
0
Entering edit mode

6 years after: Thanks Manu, that one is fast !

ADD REPLY
8
Entering edit mode
13.7 years ago

There is an alternative to brent's pyfasta, using Kent's src.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.
Files split by sequence will be broken at the nearest fa record boundary.
Files split by base will be broken at any base.
Files broken by size will be broken every count bases.

Examples:
   faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries

   faSplit base chr1.fa 10 1_
This will break up chr1.fa into 10 files

   faSplit size input.fa 2000 outRoot
This breaks up input.fa into 2000 base chunks

   faSplit about est.fa 20000 outRoot
This will break up est.fa into files of about 20000 bytes each by record.

   faSplit byname scaffolds.fa outRoot/
This breaks up scaffolds.fa using sequence names as file names.
       Use the terminating / on the outRoot to get it to work correctly.

   faSplit gap chrN.fa 20000 outRoot
This breaks up chrN.fa into files of at most 20000 bases each,
at gap boundaries if possible.  If the sequence ends in N's, the last
piece, if larger than 20000, will be all one piece.
ADD COMMENT
0
Entering edit mode

How can I make faSplit run. I have downloaded the jksrc folder and I can see ksrc/src/utils/faSplit. but when i make it, it don't provide any exe file.

ADD REPLY
2
Entering edit mode

If you are on a Linux machine, the pre-compliled files can be found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

ADD REPLY
0
Entering edit mode

make sure that you compiled successfully, then the binaries can be found in ${HOME}/bin/${MACHTYPE}/. Make sure the folder is there before you compile (see README in the package).

ADD REPLY
8
Entering edit mode
13.7 years ago
Rob ▴ 150

I actually wrote one of these yesterday. Here's the code. It's rough, but hopefully understandable - it uses BioPerl. The idea is that you'll specify how many sequences you want per file e.g. ./splitfasta.pl allseqs.fa splitseqs 100 - it'll create splitseqs.1,2,3 etc each with 100 sequences in them.

#!/usr/bin/perl

use strict;
use Bio::SeqIO;

my $from = shift;
my $toprefix = shift;
my $seqs = shift;

my $in  = new Bio::SeqIO(-file  => $from);

my $count = 0;
my $fcount = 1;
my $out = new Bio::SeqIO(-file => ">$toprefix.$fcount", -format=>'fasta');
while (my $seq = $in->next_seq) {
        if ($count % $seqs == 0) {
                $fcount++;
                $out = new Bio::SeqIO(-file => ">$toprefix.$fcount", -format=>'fasta');
        }
        $out->write_seq($seq);
        $count++;
}
ADD COMMENT
0
Entering edit mode

to get the first file populated, change

if ($count > 0 && $count % $seqnum == 0) {
ADD REPLY
0
Entering edit mode

It should be:

if ($count > 0 && $count % $seqs == 0) {
ADD REPLY
7
Entering edit mode
9.1 years ago
new ▴ 70

You can also use a dodgy little bash script in a pinch e.g.:

csplit -z myfile.fas '/>/' '{*}'

http://41j.com/blog/2011/01/split-fasta-file-into-files-with-one-contig-per-file/

ADD COMMENT
4
Entering edit mode
12.3 years ago

You could just use csplit in Linux. You can specify how large the files are and use a simple regex to specify what should be at the start of each new file ('>').

http://www.computerhope.com/unix/usplit.htm

ADD COMMENT
1
Entering edit mode

I knew this one and could be very convenient when I want to do that on an other workstation as mine, where genometools is not installed. But I never figured out on how to have more than one sequence per output file... Can you show an example?

ADD REPLY
0
Entering edit mode

I can show an example. These commands worked for me: csplit panTro5.fa /\>chr.*/ {*} followed by for a in x*; do echo $a; mv $a $(head -1 $a).txt; done; As you said, very convenient on machines other than yours.

ADD REPLY
3
Entering edit mode
ADD COMMENT
2
Entering edit mode

Above link is dead.

ADD REPLY
3
Entering edit mode
13.7 years ago
hurfdurf ▴ 490

fastasplit in the exonerate package's utilities (bottom of the page) does exactly this.

ADD COMMENT
3
Entering edit mode
9.5 years ago
Prakki Rama ★ 2.7k

I use fasta_splitter for this purpose. Its good too!

ADD COMMENT
3
Entering edit mode
7.4 years ago

Another option, from the BBMap package:

partition.sh in=file.fasta out=part%.fasta ways=5

This is multithreaded and very fast. It works on fastq also.

ADD COMMENT
1
Entering edit mode
7.4 years ago
dukecomeback ▴ 40
#!/usr/bin/perl -w
my $usage= <<EOF;

This is for split a fasta zigzagly.
usage: perl $0 x.fa num
Warning: you'd better mkdir a new directory for this.
                                                          Du Kang 2017-1-17

EOF
#obtain seq
open SEQ, $ARGV[0] or die $usage;
while (<SEQ>) {
        chomp;
        if (/>/) {
                s/>//;
                @_=split;
                $name=$_[0];
                }else{
                        $seq{$name}.=$_;
                     }
}

#rank according seq length
foreach $name (keys %seq){
        $length{$name}=length $seq{$name};
}

@id=sort {$length{$a} <=> $length{$b}} keys %length;

#zigzag out each seq
$filenum=$ARGV[1] or die $usage;
$n=0;
$flag=0;
foreach $name (@id) {
        if ($n>=$flag) {
                if ($n==$filenum) {
                        $flag=$flag+2;
                }else{
                        $flag=$n;
                        $n++;
                }
        }else{
                if ($n==1) {
                        $flag=$flag-2;
                }else{
                        $flag=$n;
                        $n--;
                }
        }
        open OUT, ">>hehe.$n" or die $!;
        print OUT ">$name\n$seq{$name}\n";
        close OUT;
}

It will rank the sequences according to the length, then zigzag dispatch them to make the result files almost even in size.

ADD COMMENT
0
Entering edit mode
9.6 years ago
/**
 * This tool aims to chop the file in various parts based on the number of sequences required in one file.
 */
package devtools.utilities;

import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringUtils;

//import java.util.List;

/**
 * @author Arpit
 * 
 */
public class FileChopper {

    public void chopFile(String fileName, int numOfFiles) throws IOException {
        byte[] allBytes = null;
        String outFileName = StringUtils.substringBefore(fileName, ".fasta");
        System.out.println(outFileName);
        try {
            allBytes = Files.readAllBytes(Paths.get(fileName));
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        String allLines = new String(allBytes, StandardCharsets.UTF_8);
        // Using a clever cheat with help from stackoverflow
        String cheatString = allLines.replace(">", "~>");
        cheatString = cheatString.replace("\\s+", "");
        String[] splitLines = StringUtils.split(cheatString, "~");
        int startIndex = 0;
        int stopIndex = 0;

        FileWriter fw = null;
        for (int j = 0; j < numOfFiles; j++) {

            fw = new FileWriter(outFileName.concat("_")
                    .concat(Integer.toString(j)).concat(".fasta"));
            if (j == (numOfFiles - 1)) {
                stopIndex = splitLines.length;
            } else {
                stopIndex = stopIndex + (splitLines.length / numOfFiles);
            }
            for (int i = startIndex; i < stopIndex; i++) {
                fw.write(splitLines[i]);
            }
            if (j < (numOfFiles - 1)) {
                startIndex = stopIndex;
            }
            fw.close();
        }

    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        FileChopper fc = new FileChopper();
        try {
            fc.chopFile("H:\\Projects\\Lactobacillus rhamnosus\\Hypothetical proteins sequence 405 LR24.fasta",5);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}
ADD COMMENT

Login before adding your answer.

Traffic: 2473 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6