Question

How To Split A Multiple Fasta

24

Entering edit mode

13.7 years ago

Gvj ▴ 470

How to split a Multiple fasta file into separate files having almost similar file size as specified? Do you have any tool for that? But the tool shouldn't split individual fasta entry

Gvj

fasta split • 52k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 13.7 years ago by Gvj ▴ 470

Ram · Answer 1 · 2010-08-23

17

Entering edit mode

13.7 years ago

Aaronquinlan 12k

I suggest the fastasplitn command. It works like a charm for me. pyfasta has similar functionality:

# split a fasta file into 6 new files of relatively even size:
pyfasta split -n 6 original.fasta

ADD COMMENT • link updated 5.4 years ago by Ram 43k • written 13.7 years ago by Aaronquinlan 12k

Ram · Answer 2 · 2012-01-18

9

Entering edit mode

12.3 years ago

Manu Prestat 4.1k

The most efficient I know is a GenomeThreader tool (EDIT: genome tools actually):

If you want to constrain the number of output files (here 60):

gt splitfasta -numfiles 60 seqs.fasta

If you want to constrain the size in MB (here 20) of each output file:

gt splitfasta -targetsize 20 seqs.fasta

ADD COMMENT • link updated 5.4 years ago by Ram 43k • written 12.3 years ago by Manu Prestat 4.1k

0

Entering edit mode

The gt program is actually part of the GenomeTools library. The GenomeThreader tool (gth on the command line) is developed by the same group but is used for spliced alignment.

ADD REPLY • link 12.3 years ago by Daniel Standage 4.1k

0

Entering edit mode

You are right ;-) I edit the answer now.

ADD REPLY • link 12.3 years ago by Manu Prestat 4.1k

0

Entering edit mode

6 years after: Thanks Manu, that one is fast !

ADD REPLY • link 6.2 years ago by jnesme • 0

Ram · Answer 3 · 2010-08-23

8

Entering edit mode

13.7 years ago

Haibao Tang 3.0k

There is an alternative to brent's pyfasta, using Kent's src.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.
Files split by sequence will be broken at the nearest fa record boundary.
Files split by base will be broken at any base.
Files broken by size will be broken every count bases.

Examples:
   faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries

   faSplit base chr1.fa 10 1_
This will break up chr1.fa into 10 files

   faSplit size input.fa 2000 outRoot
This breaks up input.fa into 2000 base chunks

   faSplit about est.fa 20000 outRoot
This will break up est.fa into files of about 20000 bytes each by record.

   faSplit byname scaffolds.fa outRoot/
This breaks up scaffolds.fa using sequence names as file names.
       Use the terminating / on the outRoot to get it to work correctly.

   faSplit gap chrN.fa 20000 outRoot
This breaks up chrN.fa into files of at most 20000 bases each,
at gap boundaries if possible.  If the sequence ends in N's, the last
piece, if larger than 20000, will be all one piece.

ADD COMMENT • link updated 5.4 years ago by Ram 43k • written 13.7 years ago by Haibao Tang 3.0k

0

Entering edit mode

How can I make faSplit run. I have downloaded the jksrc folder and I can see ksrc/src/utils/faSplit. but when i make it, it don't provide any exe file.

ADD REPLY • link 13.7 years ago by Gvj ▴ 470

2

Entering edit mode

If you are on a Linux machine, the pre-compliled files can be found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

ADD REPLY • link 10.0 years ago by Vivek ★ 2.7k

0

Entering edit mode

make sure that you compiled successfully, then the binaries can be found in ${HOME}/bin/${MACHTYPE}/. Make sure the folder is there before you compile (see README in the package).

ADD REPLY • link 13.7 years ago by Haibao Tang 3.0k

Ram · Answer 4 · 2010-08-25

8

Entering edit mode

13.7 years ago

Rob ▴ 150

I actually wrote one of these yesterday. Here's the code. It's rough, but hopefully understandable - it uses BioPerl. The idea is that you'll specify how many sequences you want per file e.g. ./splitfasta.pl allseqs.fa splitseqs 100 - it'll create splitseqs.1,2,3 etc each with 100 sequences in them.

#!/usr/bin/perl

use strict;
use Bio::SeqIO;

my $from = shift;
my $toprefix = shift;
my $seqs = shift;

my $in  = new Bio::SeqIO(-file  => $from);

my $count = 0;
my $fcount = 1;
my $out = new Bio::SeqIO(-file => ">$toprefix.$fcount", -format=>'fasta');
while (my $seq = $in->next_seq) {
        if ($count % $seqs == 0) {
                $fcount++;
                $out = new Bio::SeqIO(-file => ">$toprefix.$fcount", -format=>'fasta');
        }
        $out->write_seq($seq);
        $count++;
}

ADD COMMENT • link updated 5.4 years ago by Ram 43k • written 13.7 years ago by Rob ▴ 150

0

Entering edit mode

to get the first file populated, change

if ($count > 0 && $count % $seqnum == 0) {

ADD REPLY • link updated 5.4 years ago by Ram 43k • written 6.9 years ago by Stephane Plaisance ▴ 460

0

Entering edit mode

It should be:

if ($count > 0 && $count % $seqs == 0) {

ADD REPLY • link 3.0 years ago by biomonte ▴ 220

Ram · Answer 5 · 2015-03-04

7

Entering edit mode

9.1 years ago

new ▴ 70

You can also use a dodgy little bash script in a pinch e.g.:

csplit -z myfile.fas '/>/' '{*}'

http://41j.com/blog/2011/01/split-fasta-file-into-files-with-one-contig-per-file/

ADD COMMENT • link updated 5.4 years ago by Ram 43k • written 9.1 years ago by new ▴ 70

score 4 · Answer 6 · 2012-01-19

4

Entering edit mode

12.3 years ago

Kieren Lythgow ▴ 110

You could just use csplit in Linux. You can specify how large the files are and use a simple regex to specify what should be at the start of each new file ('>').

http://www.computerhope.com/unix/usplit.htm

ADD COMMENT • link 12.3 years ago by Kieren Lythgow ▴ 110

1

Entering edit mode

I knew this one and could be very convenient when I want to do that on an other workstation as mine, where genometools is not installed. But I never figured out on how to have more than one sequence per output file... Can you show an example?

ADD REPLY • link 12.3 years ago by Manu Prestat 4.1k

0

Entering edit mode

I can show an example. These commands worked for me: csplit panTro5.fa /\>chr.*/ {*} followed by for a in x*; do echo $a; mv $a $(head -1 $a).txt; done; As you said, very convenient on machines other than yours.

ADD REPLY • link 7.1 years ago by Biomonika (Noolean) 3.2k

h.mon · Answer 7 · 2010-08-23

3

Entering edit mode

13.7 years ago

Paulo Nuin ★ 3.7k

Check here

http://python.genedrift.org/2007/10/10/alternative-methods-to-split-a-fasta-file/

ADD COMMENT • link updated 6.3 years ago by h.mon 35k • written 13.7 years ago by Paulo Nuin ★ 3.7k

2

Entering edit mode

Above link is dead.

ADD REPLY • link 12.4 years ago by boczniak767 ▴ 850

0

Entering edit mode

archived link: https://web.archive.org/web/20071027112709/http://python.genedrift.org/2007/10/10/alternative-methods-to-split-a-fasta-file/

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by terence.strobaugh • 0

Ram · Answer 8 · 2010-08-24

3

Entering edit mode

13.7 years ago

hurfdurf ▴ 490

fastasplit in the exonerate package's utilities (bottom of the page) does exactly this.

ADD COMMENT • link updated 5.4 years ago by Ram 43k • written 13.7 years ago by hurfdurf ▴ 490

Ram · Answer 9 · 2014-10-13

3

Entering edit mode

9.5 years ago

Prakki Rama ★ 2.7k

I use fasta_splitter for this purpose. Its good too!

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Prakki Rama ★ 2.7k

score 3 · Answer 10 · 2016-11-29

3

Entering edit mode

7.4 years ago

Brian Bushnell 20k

Another option, from the BBMap package:

partition.sh in=file.fasta out=part%.fasta ways=5

This is multithreaded and very fast. It works on fastq also.

ADD COMMENT • link 7.4 years ago by Brian Bushnell 20k

Ram · Answer 11 · 2016-11-29

#!/usr/bin/perl -w
my $usage= <<EOF;

This is for split a fasta zigzagly.
usage: perl $0 x.fa num
Warning: you'd better mkdir a new directory for this.
                                                          Du Kang 2017-1-17

EOF
#obtain seq
open SEQ, $ARGV[0] or die $usage;
while (<SEQ>) {
        chomp;
        if (/>/) {
                s/>//;
                @_=split;
                $name=$_[0];
                }else{
                        $seq{$name}.=$_;
                     }
}

#rank according seq length
foreach $name (keys %seq){
        $length{$name}=length $seq{$name};
}

@id=sort {$length{$a} <=> $length{$b}} keys %length;

#zigzag out each seq
$filenum=$ARGV[1] or die $usage;
$n=0;
$flag=0;
foreach $name (@id) {
        if ($n>=$flag) {
                if ($n==$filenum) {
                        $flag=$flag+2;
                }else{
                        $flag=$n;
                        $n++;
                }
        }else{
                if ($n==1) {
                        $flag=$flag-2;
                }else{
                        $flag=$n;
                        $n--;
                }
        }
        open OUT, ">>hehe.$n" or die $!;
        print OUT ">$name\n$seq{$name}\n";
        close OUT;
}

It will rank the sequences according to the length, then zigzag dispatch them to make the result files almost even in size.

Ram · Answer 12 · 2014-09-14

/**
 * This tool aims to chop the file in various parts based on the number of sequences required in one file.
 */
package devtools.utilities;

import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringUtils;

//import java.util.List;

/**
 * @author Arpit
 * 
 */
public class FileChopper {

    public void chopFile(String fileName, int numOfFiles) throws IOException {
        byte[] allBytes = null;
        String outFileName = StringUtils.substringBefore(fileName, ".fasta");
        System.out.println(outFileName);
        try {
            allBytes = Files.readAllBytes(Paths.get(fileName));
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        String allLines = new String(allBytes, StandardCharsets.UTF_8);
        // Using a clever cheat with help from stackoverflow
        String cheatString = allLines.replace(">", "~>");
        cheatString = cheatString.replace("\\s+", "");
        String[] splitLines = StringUtils.split(cheatString, "~");
        int startIndex = 0;
        int stopIndex = 0;

        FileWriter fw = null;
        for (int j = 0; j < numOfFiles; j++) {

            fw = new FileWriter(outFileName.concat("_")
                    .concat(Integer.toString(j)).concat(".fasta"));
            if (j == (numOfFiles - 1)) {
                stopIndex = splitLines.length;
            } else {
                stopIndex = stopIndex + (splitLines.length / numOfFiles);
            }
            for (int i = startIndex; i < stopIndex; i++) {
                fw.write(splitLines[i]);
            }
            if (j < (numOfFiles - 1)) {
                startIndex = stopIndex;
            }
            fw.close();
        }

    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        FileChopper fc = new FileChopper();
        try {
            fc.chopFile("H:\\Projects\\Lactobacillus rhamnosus\\Hypothetical proteins sequence 405 LR24.fasta",5);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}