how to convert a long fasta-file into many separate single fasta sequences
7
6
Entering edit mode
9.8 years ago
natasha.sernova ★ 4.0k

Hello, everybody!

I am looking for the procedure reversed to cat.

I would like to separate a long fasta-file into many single fasta sequences, so each sequence from the large fa-file will be situated in a new small file.fa each. I usually used something like the following, but it not as good as it looks like... Where is the mistake?

Many thanks for your help!

N.

sub get_fasta {
  my $inputName = $_[0];
  my $fh;
# new input record separator for fasta-files
  $/ = "\n>";
open ($fh, $inputName) or die "cannot open file: $!\n";
my @entries = <$fh>;
 # remove last character '>' except for the last entry
  chop(@entries[0..$#entries - 1]);
# add '>' except for the 1st entry
  foreach (@entries[1..$#entries]) {$_ = '>'.$_;}
# print"@entries";
close($fh);
  $/="\n";
  return \@entries;
sequence fasta • 37k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thank you, I've missed really a lot.

Natasha

ADD REPLY
19
Entering edit mode
9.8 years ago

Another option in bash... Possibly more concise and readable...

while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=${line#>}.fa
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done < myseq.fa

You input file is myseq.fa. Individual files will have the name of the corresponding sequence, without leading >. If your sequence names contain funny characters (non alphanumeric) it will probably break but could be fixed depending exactly how you want to call the output files.

ADD COMMENT
1
Entering edit mode

Hey all, I just tried to use this script and got an error for ambiguous redirect. Any ideas? Thanks in advance.

ADD REPLY
0
Entering edit mode

Hi, placing quotation marks around the $outfile (ie. "$outfile") should fix it. Hope you have figured it out earlier

ADD REPLY
0
Entering edit mode

nice use of bash substitutions.

ADD REPLY
0
Entering edit mode

Perfect! Now it has given even the sub-sequence names from the large fasta as

each small fasta-file name! Really great!

Thousand thanks!

Natasha

ADD REPLY
0
Entering edit mode

one of the best one-liners to split a fasta file.

ADD REPLY
0
Entering edit mode

How could this be modified to: 1. Instead name the output file(s) the same as the input file and perhaps appended with the corresponding sequence name? 2. Run as a loop so that this could be used to split hundreds of different input files (each with different names corresponding to gene names).

for file in $(ls *.fasta)
do
while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=$file.${line#>}.fasta
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done
done

I can't seem to figure it out.

ADD REPLY
11
Entering edit mode
9.8 years ago

linearize your fasta sequence (two lines per sequence) and use split -l 2

cat *.fasta |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\
split -l 2 - seq_
ADD COMMENT
0
Entering edit mode

Dear Pierre,

It's a very nice idea - to kill "cat " with the same weapon! I will definitely try this!

Thank you very much! It did work as well.

Natasha

ADD REPLY
0
Entering edit mode

It works very nice! plus i found that you can add the option --additional-suffix to add the extension you prefer.

cat *.fasta |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\
split -l 2 --additional-suffix=.fasta - seq_
ADD REPLY
0
Entering edit mode
This one could spilt a big file into each, but I would like use the header name as each file name, is that possible?
For example: here is a big file:
>OTU1
ATCGATCGATCGTCGATCG
...
>OTU1000
TCGATCGTCGATCGTCGATCGTCGATCG
The output should be like this:
OTU1.fasta
...
OTU1000.fasta
ADD REPLY
0
Entering edit mode
Here is the answer that I got, I hope to help novices:
1:first step to spilt into each file
cat *.fasta |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\
split -l 2  - OTU
2:second step to add extension
for file in OTU*
do
 mv "$file" "$file.fasta"
done
3:third step to change the file name to the header name
for i in *.fasta; do 
mv $i $(head -1 $i | cut -f1 -d ' ' | tr -d '>' ).fasta
done
By doing these, I got the output that what I want. Thanks for people who gave me an idea.
ADD REPLY
5
Entering edit mode
9.8 years ago

Reading the whole genome into memory seems like a rather wasteful way to go about things. Just process things like by line (note, this is a random python script I have sitting around, you'd need to edit it):

#!/usr/bin/env python
f=open("genome.fa","r");
opened = False
for line in f :
    if(line[0] == ">") :
        if(opened) :
            of.close()
        opened = True
        of=open("%s.fa" % (line[1:].rstrip()), "w")
        print(line[1:].rstrip())
    of.write(line)
of.close()

You'd obviously need to change the filename. This would also need to be changed if you have descriptors after a contig/chromosome name, but it's enough to give you the idea.

ADD COMMENT
1
Entering edit mode

Thanks for the script Devon! It works well on linux and unix based systems. I edited it with a bit of annotation.

#  GENERAL USAGE: python ./scripts/Divide_Reference.fa_by_Contig.py ./data/reference/reference.fa

import os
from os import path
import sys

infile=open(sys.argv[1])

os.system("cd /mnt/scratch/steepale/birdman/MDV_project/DNAseq/GATK_Variant_Analysis_Germline_DNA") # Add your project directory in here

path = sys.argv[1].split('/')[0] + "/" + sys.argv[1].split('/')[1] + "/" + sys.argv[1].split('/')[2]

ref_file = path + "/" + sys.argv[1].split('/')[3]
print('\n' + "reference file: " + ref_file + '\n')
num_contigs = 'grep "^>" %s  | wc -l' % (ref_file)
print("The number of contigs in reference fasta:")
os.system(num_contigs)

opened = False # Assume outfile is not open

for line_ref in infile:
    if line_ref[0] == ">": # If line begins with ">"
        if(opened): 
            outfile.close() # Will close the outfile if it is open (see below and follow loop)
        opened = True # Set opened to True to represent an opened outfile
        contig_name = line_ref[1:].rstrip() #Extract contig name: remove ">", extract contig string, remove any spaces or new lins following file
        print("contig: " + contig_name)
        print("Name of outfile: " + path + "/" + str(sys.argv[1].split('/')[3].split(".")[0]) + "_" + str(contig_name) + ".fa") 
        outfile=open(path + "/" + str(sys.argv[1].split('/')[3].split(".")[0]) + "_" + str(contig_name) + ".fa", 'w')
    outfile.write(line_ref) # Write the line to the file. If line started with ">" a new output file was created and ready to be written to.
                            # Subsequent lines (AKA the sequence) will also be written as the script loops through lines.
                            # Not until we reach a new line beginning with ">" is the outfile closed and new outfile generated with new name.
outfile.close()
print "Fin"
ADD REPLY
0
Entering edit mode

Looks good, if you really want to go whole hog then add argparse in there :)

ADD REPLY
0
Entering edit mode

Hello!

Unfortunately, I'm very bad in Python. I will try to edit your script, but I am not quite sure of the result.

Natasha

ADD REPLY
0
Entering edit mode

Dear Devon,

I've run your script, it gave me just names of the sequences

in the fasta-file, no sequences at all.

ADD REPLY
0
Entering edit mode

Perhaps you have Windows line endings, in which case open with "rU" rather than just "r". Other than that, what you're describing wouldn't be a possible outcome of what I posted unless the original file also contains only contig/chromosome names.

ADD REPLY
0
Entering edit mode

Dear Devon, I'm really sorry!

It did work in the first time, and I now have a nice list of sepated fasta-files!

Many-many thanks!

Natasha

ADD REPLY
3
Entering edit mode
9.8 years ago

Easy Python script using a package I wrote:

ADD COMMENT
0
Entering edit mode

Great, thank you!

Natasha

ADD REPLY
1
Entering edit mode
9.8 years ago
hpmcwill ★ 1.2k

A couple of options that come to mind:

  • EMBOSS seqretsplit
  • csplit, for example: csplit -f outfile. input.fa '%^>%' '/^>/' '{*}'
ADD COMMENT
1
Entering edit mode
6.9 years ago
iuli.151293 ▴ 10

If you want to separate a multi-fasta file, you can use the above script but you have to delete the fasta header. if you want a faster way, you can use the following script. It allows you to separate each sequence in an individual fasta file and the name of that file will be the first 11 characters after the ">" without deleting the header.

while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=${line:1:11}.fa
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done < myseq.fa
ADD COMMENT
0
Entering edit mode
7.2 years ago
BioApps ▴ 790

This tool I built will split a multiFasta, SFF, FastQ file in individual OR smaller-block fasta files. The tool works also on Linux via WinE without any drawbacks. It can also convert between different formats and do sequence processing as FastQC.

split mutiple fasta to single/indiviual fasta

ADD COMMENT

Login before adding your answer.

Traffic: 2493 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6