Question: How To Split One Big Sequence File Into Multiple Files With Less Than 1000 Sequences In A Single File
3
gravatar for Hamilton
7.8 years ago by
Hamilton260
Hamilton260 wrote:

Hi,

I am too native here, sorry first for trivial question,

I am trying to split one big sequence FASTA file into multiple files with less than 1000 sequences in a single file.

how can i generate them?

Any inputs will be very appreciated,

fasta next-gen sequencing perl R rna • 47k views
ADD COMMENTlink modified 14 months ago by har-wradim20 • written 7.8 years ago by Hamilton260
17
gravatar for Pierre Lindenbaum
7.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

using awk:

awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%1000==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < sequences.fa
ADD COMMENTlink written 7.8 years ago by Pierre Lindenbaum122k

Works fine, but if input file is large enough it prints error: awk: cannot open "myseq25525.fa" for output (Too many open files). But, as in my case, one can split larger than 25000 sequence file to few smaller.

ADD REPLYlink written 7.7 years ago by boczniak767640

It works good, but if I have to split my file (60.000 sequences) to files with 25 sequences I get an error awk: cannot open "myseq25525.fa" for output (Too many open files). But I can just split my large file to smaller parts.

ADD REPLYlink written 7.7 years ago by boczniak767640

Hi,

How can I change the output format from myseq0000.fa, myseq1000.fa..... to include the name of input fasta sequence.

E.g.,

Input fasta file name: ABC.fa

Output file names: ABC_myseq0000.fa

Your inputs will be much appreciated.

Vijay

ADD REPLYlink written 4.4 years ago by vijay.bhaaskarla0

Thanku , it is very easy way of splitting big fasta file.   :)

ADD REPLYlink written 4.1 years ago by harpreetmanku0410

hello, is there any awk command for extracting upstream and downstream region fron the seq?

ADD REPLYlink written 4.1 years ago by harpreetmanku0410

I am sorry to ask a certainly silly question but I am trying to do this with a fasta file containing 300 000 contigs to separate in 6 files of less than 50 000 sequences. I have made a test.fa file containing 15 sequences to try awk described above did not work in R (may be obvious)

I have just started R and would like a prog or extension that I could use in R Anyone could redirect me pleease

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by alexandra.lanot0

Awk is one of bash commands, not R.

ADD REPLYlink written 3.2 years ago by boczniak767640

I tried this for my sequence file , but not working

awk: cmd. line:1: (FILENAME=- FNR=1) fatal: expression for `>>' redirection has null string value
ADD REPLYlink written 2.6 years ago by tcf.hcdg60
1

For me it seems that you haven't provide the output file name (required after the>>).

ADD REPLYlink written 2.6 years ago by boczniak767640

got it. my mistake. Now worked

ADD REPLYlink written 2.6 years ago by tcf.hcdg60
11
gravatar for Giovanni M Dall'Olio
7.8 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

I think you can do it with PyFasta:

split a fasta file into 6 new files of relatively even size:

$ pyfasta split -n 6 original.fasta

If you know the number of sequences in the file (just grep '>' original_fasta | wc -l), you can easily calculate how to split it in <1000 files.

ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by Giovanni M Dall'Olio26k
1

You can also do this using pyfaidx, with the command:

faidx --split-files original.fasta

Each sequence will be placed in its own file with a filename derived from the sequence identifier.

ADD REPLYlink written 4.0 years ago by Matt Shirley9.1k
9
gravatar for dli
7.8 years ago by
dli220
WUSTL
dli220 wrote:

I would like to recommend the faSplit utility from the UCSC Genome Browser suite. However, i didn't found a downloaded binary from their site: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/, but you can compile it yourself.

$ faSplit 
faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.  
Files split by sequence will be broken at the nearest fa record boundary. 
Files split by base will be broken at any base.  
Files broken by size will be broken every count bases.

Examples:
   faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries

   faSplit base chr1.fa 10 1_
This will break up chr1.fa into 10 files

   faSplit size input.fa 2000 outRoot
This breaks up input.fa into 2000 base chunks

   faSplit about est.fa 20000 outRoot
This will break up est.fa into files of about 20000 bytes each by record.

   faSplit byname scaffolds.fa outRoot/ 
This breaks up scaffolds.fa using sequence names as file names.
       Use the terminating / on the outRoot to get it to work correctly.

   faSplit gap chrN.fa 20000 outRoot
This breaks up chrN.fa into files of at most 20000 bases each, 
at gap boundaries if possible.  If the sequence ends in N's, the last
piece, if larger than 20000, will be all one piece.

Options:
    -verbose=2 - Write names of each file created (=3 more details)
    -maxN=N - Suppress pieces with more than maxN n's.  Only used with size.
              default is size-1 (only suppresses pieces that are all N).
    -oneFile - Put output in one file. Only used with size
    -extra=N - Add N extra bytes at the end to form overlapping pieces.  Only used with size.
    -out=outFile Get masking from outfile.  Only used with size.
    -lift=file.lft Put info on how to reconstruct sequence from
                   pieces in file.lft.  Only used with size and gap.
    -minGapSize=X Consider a block of Ns to be a gap if block size >= X.
                  Default value 1000.  Only used with gap.
    -noGapDrops - include all N's when splitting by gap.
    -outDirDepth=N Create N levels of output directory under current dir.
                   This helps prevent NFS problems with a large number of
                   file in a directory.  Using -outDirDepth=3 would
                   produce ./1/2/3/outRoot123.fa.
    -prefixLength=N - used with byname option. create a separate output
                   file for each group of sequences names with same prefix
                   of length N.
ADD COMMENTlink modified 7.3 years ago • written 7.8 years ago by dli220
5
gravatar for toni
7.8 years ago by
toni2.1k
Lyon
toni2.1k wrote:

Supposing that the sequence information is written on a single line (one header + one single sequence line), you could simply have a look at split function under Linux.

split -l 2000 input.fasta

'man split' for more customization of output files generated.

Tony

ADD COMMENTlink written 7.8 years ago by toni2.1k
6

Suppose it isn't :)

ADD REPLYlink written 5.1 years ago by Neilfws48k
1

Multi-line to single line fasta

ADD REPLYlink written 3.0 years ago by st.ph.n2.5k
1
gravatar for arpit.singh1203
4.9 years ago by
India
arpit.singh120360 wrote:
/**
 * This tool aims to chop the file in various parts based on the number of sequences required in one file.
 */
package devtools.utilities;

import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringUtils;

//import java.util.List;

/**
 * @author Arpit
 * 
 */
public class FileChopper {

    public void chopFile(String fileName, int numOfFiles) throws IOException {
        byte[] allBytes = null;
        String outFileName = StringUtils.substringBefore(fileName, ".fasta");
        System.out.println(outFileName);
        try {
            allBytes = Files.readAllBytes(Paths.get(fileName));
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        String allLines = new String(allBytes, StandardCharsets.UTF_8);
        // Using a clever cheat with help from stackoverflow
        String cheatString = allLines.replace(">", "~>");
        cheatString = cheatString.replace("\\s+", "");
        String[] splitLines = StringUtils.split(cheatString, "~");
        int startIndex = 0;
        int stopIndex = 0;

        FileWriter fw = null;
        for (int j = 0; j < numOfFiles; j++) {

            fw = new FileWriter(outFileName.concat("_")
                    .concat(Integer.toString(j)).concat(".fasta"));
            if (j == (numOfFiles - 1)) {
                stopIndex = splitLines.length;
            } else {
                stopIndex = stopIndex + (splitLines.length / numOfFiles);
            }
            for (int i = startIndex; i < stopIndex; i++) {
                fw.write(splitLines[i]);
            }
            if (j < (numOfFiles - 1)) {
                startIndex = stopIndex;
            }
            fw.close();
        }

    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        FileChopper fc = new FileChopper();
        try {
            fc.chopFile("H:\\Projects\\filename.fasta",5);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by arpit.singh120360

I couldn't find any java code snippets. So I wrote one. This might not be optimized. Please review this so that I can learn. Thank you.

ADD REPLYlink written 4.9 years ago by arpit.singh120360
0
gravatar for har-wradim
14 months ago by
har-wradim20
Germany
har-wradim20 wrote:

A shorter and more user-friendly awk version without a limit on the number of input fasta records:

awk -v size=1000 -v pre=prefix -v pad=5 '/^>/{n++;if(n%size==1){close(f);f=sprintf("%s.%0"pad"d",pre,n)}}{print>>f}' input.fasta

Command arguments are: size -- chunk size, pre -- output file prefix, pad -- padding width (the width of the numeric suffix).

Ungolfed version for clarity:

awk -v size=1000 -v pre=prefix -v pad=5 '
   /^>/ { n++; if (n % size == 1) { close(fname); fname = sprintf("%s.%0" pad "d", pre, n) } }
   { print >> fname }
' input.fasta
ADD COMMENTlink written 14 months ago by har-wradim20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour