How To Split One Big Sequence File Into Multiple Files With Less Than 1000 Sequences In A Single File
7
9
Entering edit mode
13.0 years ago
Hamilton ▴ 290

Hi,

I am too native here, sorry first for trivial question,

I am trying to split one big sequence FASTA file into multiple files with less than 1000 sequences in a single file.

how can i generate them?

Any inputs will be very appreciated,

next-gen sequencing fasta rna r perl • 94k views
ADD COMMENT
28
Entering edit mode
13.0 years ago

using awk:

awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%1000==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < sequences.fa
ADD COMMENT
0
Entering edit mode

Works fine, but if input file is large enough it prints error: awk: cannot open "myseq25525.fa" for output (Too many open files). But, as in my case, one can split larger than 25000 sequence file to few smaller.

ADD REPLY
0
Entering edit mode

It works good, but if I have to split my file (60.000 sequences) to files with 25 sequences I get an error awk: cannot open "myseq25525.fa" for output (Too many open files). But I can just split my large file to smaller parts.

ADD REPLY
0
Entering edit mode

Hi,

How can I change the output format from myseq0000.fa, myseq1000.fa..... to include the name of input fasta sequence.

E.g.,

Input fasta file name: ABC.fa

Output file names: ABC_myseq0000.fa

Your inputs will be much appreciated.

Vijay

ADD REPLY
0
Entering edit mode

Thank you, it is very easy way of splitting big fasta file. :)

ADD REPLY
0
Entering edit mode

hello, is there any awk command for extracting upstream and downstream region fron the seq?

ADD REPLY
0
Entering edit mode

I am sorry to ask a certainly silly question but I am trying to do this with a fasta file containing 300 000 contigs to separate in 6 files of less than 50 000 sequences. I have made a test.fa file containing 15 sequences to try awk described above did not work in R (may be obvious)

I have just started R and would like a prog or extension that I could use in R Anyone could redirect me pleease

ADD REPLY
1
Entering edit mode

Awk is one of bash commands, not R.

ADD REPLY
0
Entering edit mode

I tried this for my sequence file , but not working

awk: cmd. line:1: (FILENAME=- FNR=1) fatal: expression for `>>' redirection has null string value
ADD REPLY
1
Entering edit mode

For me it seems that you haven't provide the output file name (required after the>>).

ADD REPLY
0
Entering edit mode

got it. my mistake. Now worked

ADD REPLY
15
Entering edit mode
13.0 years ago

I think you can do it with PyFasta:

split a fasta file into 6 new files of relatively even size:

 $ pyfasta split -n 6 original.fasta
  

If you know the number of sequences in the file (just grep '>' original_fasta | wc -l), you can easily calculate how to split it in <1000 files.

ADD COMMENT
3
Entering edit mode

You can also do this using pyfaidx, with the command:

faidx --split-files original.fasta

Each sequence will be placed in its own file with a filename derived from the sequence identifier.

ADD REPLY
10
Entering edit mode
13.0 years ago
dli ▴ 250

I would like to recommend the faSplit utility from the UCSC Genome Browser suite. However, I didn't found a downloaded binary from their site, but you can compile it yourself.

$ faSplit 
faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.  
Files split by sequence will be broken at the nearest fa record boundary. 
Files split by base will be broken at any base.  
Files broken by size will be broken every count bases.

Examples:
   faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries

   faSplit base chr1.fa 10 1_
This will break up chr1.fa into 10 files

   faSplit size input.fa 2000 outRoot
This breaks up input.fa into 2000 base chunks

   faSplit about est.fa 20000 outRoot
This will break up est.fa into files of about 20000 bytes each by record.

   faSplit byname scaffolds.fa outRoot/ 
This breaks up scaffolds.fa using sequence names as file names.
       Use the terminating / on the outRoot to get it to work correctly.

   faSplit gap chrN.fa 20000 outRoot
This breaks up chrN.fa into files of at most 20000 bases each, 
at gap boundaries if possible.  If the sequence ends in N's, the last
piece, if larger than 20000, will be all one piece.

Options:
    -verbose=2 - Write names of each file created (=3 more details)
    -maxN=N - Suppress pieces with more than maxN n's.  Only used with size.
              default is size-1 (only suppresses pieces that are all N).
    -oneFile - Put output in one file. Only used with size
    -extra=N - Add N extra bytes at the end to form overlapping pieces.  Only used with size.
    -out=outFile Get masking from outfile.  Only used with size.
    -lift=file.lft Put info on how to reconstruct sequence from
                   pieces in file.lft.  Only used with size and gap.
    -minGapSize=X Consider a block of Ns to be a gap if block size >= X.
                  Default value 1000.  Only used with gap.
    -noGapDrops - include all N's when splitting by gap.
    -outDirDepth=N Create N levels of output directory under current dir.
                   This helps prevent NFS problems with a large number of
                   file in a directory.  Using -outDirDepth=3 would
                   produce ./1/2/3/outRoot123.fa.
    -prefixLength=N - used with byname option. create a separate output
                   file for each group of sequences names with same prefix
                   of length N.
ADD COMMENT
6
Entering edit mode
13.0 years ago
toni ★ 2.2k

Supposing that the sequence information is written on a single line (one header + one single sequence line), you could simply have a look at split function under Linux.

split -l 2000 input.fasta

man split for more customization of output files generated.

Tony

ADD COMMENT
7
Entering edit mode

Suppose it isn't :)

ADD REPLY
1
Entering edit mode
ADD REPLY
3
Entering edit mode
6.3 years ago
alephreish ▴ 70

A shorter and more user-friendly awk version without a limit on the number of input fasta records:

awk -v size=1000 -v pre=prefix -v pad=5 '/^>/{n++;if(n%size==1){close(f);f=sprintf("%s.%0"pad"d",pre,n)}}{print>>f}' input.fasta

Command arguments are: size -- chunk size, pre -- output file prefix, pad -- padding width (the width of the numeric suffix).

Ungolfed version for clarity:

awk -v size=1000 -v pre=prefix -v pad=5 '
   /^>/ { n++; if (n % size == 1) { close(fname); fname = sprintf("%s.%0" pad "d", pre, n) } }
   { print >> fname }
' input.fasta
ADD COMMENT
1
Entering edit mode
10.1 years ago
/**
 * This tool aims to chop the file in various parts based on the number of sequences required in one file.
 */
package devtools.utilities;

import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringUtils;

//import java.util.List;

/**
 * @author Arpit
 * 
 */
public class FileChopper {

    public void chopFile(String fileName, int numOfFiles) throws IOException {
        byte[] allBytes = null;
        String outFileName = StringUtils.substringBefore(fileName, ".fasta");
        System.out.println(outFileName);
        try {
            allBytes = Files.readAllBytes(Paths.get(fileName));
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        String allLines = new String(allBytes, StandardCharsets.UTF_8);
        // Using a clever cheat with help from stackoverflow
        String cheatString = allLines.replace(">", "~>");
        cheatString = cheatString.replace("\\s+", "");
        String[] splitLines = StringUtils.split(cheatString, "~");
        int startIndex = 0;
        int stopIndex = 0;

        FileWriter fw = null;
        for (int j = 0; j < numOfFiles; j++) {

            fw = new FileWriter(outFileName.concat("_")
                    .concat(Integer.toString(j)).concat(".fasta"));
            if (j == (numOfFiles - 1)) {
                stopIndex = splitLines.length;
            } else {
                stopIndex = stopIndex + (splitLines.length / numOfFiles);
            }
            for (int i = startIndex; i < stopIndex; i++) {
                fw.write(splitLines[i]);
            }
            if (j < (numOfFiles - 1)) {
                startIndex = stopIndex;
            }
            fw.close();
        }

    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        FileChopper fc = new FileChopper();
        try {
            fc.chopFile("H:\\Projects\\filename.fasta",5);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}
ADD COMMENT
0
Entering edit mode

I couldn't find any java code snippets. So I wrote one. This might not be optimized. Please review this so that I can learn. Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 2115 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6