Question: how to convert a long fasta-file into many separate single fasta sequences
5
gravatar for natasha.sernova
4.3 years ago by
natasha.sernova3.1k
natasha.sernova3.1k wrote:

Hello, everybody!

I am looking for the procedure reversed to cat.

I would like to separate a long fasta-file

into many single fasta sequences, so

each sequence from the large fa-file will be situated in a new

small file.fa each. I usually used something like the following,

but it not as good as it looks like... Where is the mistake?

Many thanks for your help!

N.

 

sub get_fasta {

  my $inputName = $_[0];

  my $fh;

# new input record separator for fasta-files

  $/ = "\n>";

open ($fh, $inputName) or die "cannot open file: $!\n";

my @entries = <$fh>;

 # remove last character '>' except for the last entry

  chop(@entries[0..$#entries - 1]);

# add '>' except for the 1st entry

  foreach (@entries[1..$#entries]) {$_ = '>'.$_;}

# print"@entries";

close($fh);

  $/="\n";

  return \@entries;

sequence fasta • 16k views
ADD COMMENTlink modified 17 months ago by iuli.1512930 • written 4.3 years ago by natasha.sernova3.1k

Since you were trying for a Perl solution, this answer will help: Splitting A Fasta File

Here are some more solutions for you in other posts: How To Split A Multiple FastaHow To Split One Big Sequence File Into Multiple Files With Less Than 1000 Sequences In A Single FileSplit Large Fasta Into Mulitple Files, Can'T Name Them With Gi Number

ADD REPLYlink written 4.3 years ago by SES8.1k

Thank you, I've missed really a lot.

Natasha

ADD REPLYlink written 4.3 years ago by natasha.sernova3.1k
16
gravatar for dariober
4.3 years ago by
dariober9.5k
Glasgow - UK
dariober9.5k wrote:

Another option in bash... Possibly more concise and readable...

while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=${line#>}.fa
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done < myseq.fa

You input file is myseq.fa. Individual files will have the name of the corresponding sequence, without leading >. If your sequence names contain funny characters (non alphanumeric) it will probably break but could be fixed depending exactly how you want to call the output files.

 

 

ADD COMMENTlink written 4.3 years ago by dariober9.5k

nice use of bash substitutions.

ADD REPLYlink written 4.3 years ago by Pierre Lindenbaum113k

Perfect! Now it has given even the sub-sequence names from the large fasta as

each small fasta-file name! Really great!

Thousand thanks!

Natasha

ADD REPLYlink written 4.3 years ago by natasha.sernova3.1k

one of the best one-liners to split a fasta file.

ADD REPLYlink written 8 months ago by Prakki Rama2.2k

How could this be modified to: 1. Instead name the output file(s) the same as the input file and perhaps appended with the corresponding sequence name? 2. Run as a loop so that this could be used to split hundreds of different input files (each with different names corresponding to gene names).

for file in $(ls *.fasta)
do
while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=$file.${line#>}.fasta
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done
done

I can't seem to figure it out.

ADD REPLYlink modified 6 months ago • written 6 months ago by nmkn0
11
gravatar for Pierre Lindenbaum
4.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum113k wrote:

linearize your fasta sequence (two lines per sequence) and use split -l 2

cat *.fasta |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\
split -l 2 - seq_ 
ADD COMMENTlink written 4.3 years ago by Pierre Lindenbaum113k

Dear Pierre,

It's a very nice idea - to kill "cat " with the same weapon! I will definitely try this!

Thank you very much! It did work as well.

Natasha

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by natasha.sernova3.1k

It works very nice! plus i found that you can add the option --additional-suffix to add the extension you prefer.

cat *.fasta |\ awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\ split -l 2 --additional-suffix=.fasta - seq_

ADD REPLYlink written 8 months ago by jcsanchez20
4
gravatar for Devon Ryan
4.3 years ago by
Devon Ryan85k
Freiburg, Germany
Devon Ryan85k wrote:

Reading the whole genome into memory seems like a rather wasteful way to go about things. Just process things like by line (note, this is a random python script I have sitting around, you'd need to edit it):

#!/usr/bin/env python
f=open("genome.fa","r");
opened = False
for line in f :
    if(line[0] == ">") :
        if(opened) :
            of.close()
        opened = True
        of=open("%s.fa" % (line[1:].rstrip()), "w")
        print(line[1:].rstrip())
    of.write(line)
of.close()

You'd obviously need to change the filename. This would also need to be changed if you have descriptors after a contig/chromosome name, but it's enough to give you the idea.

ADD COMMENTlink written 4.3 years ago by Devon Ryan85k
1

Thanks for the script Devon! It works well on linux and unix based systems. I edited it with a bit of annotation.

#  GENERAL USAGE: python ./scripts/Divide_Reference.fa_by_Contig.py ./data/reference/reference.fa

import os
from os import path
import sys

infile=open(sys.argv[1])

os.system("cd /mnt/scratch/steepale/birdman/MDV_project/DNAseq/GATK_Variant_Analysis_Germline_DNA") # Add your project directory in here

path = sys.argv[1].split('/')[0] + "/" + sys.argv[1].split('/')[1] + "/" + sys.argv[1].split('/')[2]

ref_file = path + "/" + sys.argv[1].split('/')[3]
print('\n' + "reference file: " + ref_file + '\n')
num_contigs = 'grep "^>" %s  | wc -l' % (ref_file)
print("The number of contigs in reference fasta:")
os.system(num_contigs)

opened = False # Assume outfile is not open

for line_ref in infile:
    if line_ref[0] == ">": # If line begins with ">"
        if(opened): 
            outfile.close() # Will close the outfile if it is open (see below and follow loop)
        opened = True # Set opened to True to represent an opened outfile
        contig_name = line_ref[1:].rstrip() #Extract contig name: remove ">", extract contig string, remove any spaces or new lins following file
        print("contig: " + contig_name)
        print("Name of outfile: " + path + "/" + str(sys.argv[1].split('/')[3].split(".")[0]) + "_" + str(contig_name) + ".fa") 
        outfile=open(path + "/" + str(sys.argv[1].split('/')[3].split(".")[0]) + "_" + str(contig_name) + ".fa", 'w')
    outfile.write(line_ref) # Write the line to the file. If line started with ">" a new output file was created and ready to be written to.
                            # Subsequent lines (AKA the sequence) will also be written as the script loops through lines.
                            # Not until we reach a new line beginning with ">" is the outfile closed and new outfile generated with new name.
outfile.close()
print "Fin"
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by alec.steep10

Looks good, if you really want to go whole hog then add argparse in there :)

ADD REPLYlink written 2.4 years ago by Devon Ryan85k

Hello! Unfortunately, I'm very bad in Python. I will try to edit your script, but I am not quite sure of the result. Natasha

Thu, 3 Jul 2014 17:03:33 +0000 от "Devon Ryan on Biostar" <notifications@biostars.org>: f=open("genome.fa","r"); opened = False for line in f : if(line[0] == ">") : if(opened) : of.close() opened = True of=open("%s.fa" % (line[1:].rstrip()), "w") print(line[1:].rstrip()) of.write(line) of.close()

ADD REPLYlink written 4.3 years ago by natasha.sernova3.1k

Dear Devon,

I've run your script, it gave me just names of the sequences

in the fasta-file, no sequences at all.

 

ADD REPLYlink written 4.3 years ago by natasha.sernova3.1k

Perhaps you have Windows line endings, in which case open with "rU" rather than just "r". Other than that, what you're describing wouldn't be a possible outcome of what I posted unless the original file also contains only contig/chromosome names.

ADD REPLYlink written 4.3 years ago by Devon Ryan85k

Dear Devon, I'm really sorry!

It did work in the first time, and I now have a nice list of sepated fasta-files!

Many-many thanks!

Natasha

ADD REPLYlink written 4.3 years ago by natasha.sernova3.1k
3
gravatar for Matt Shirley
4.3 years ago by
Matt Shirley8.6k
Cambridge, MA
Matt Shirley8.6k wrote:

Easy Python script using a package I wrote:

ADD COMMENTlink written 4.3 years ago by Matt Shirley8.6k

Great, thank you!

Natasha

ADD REPLYlink written 4.3 years ago by natasha.sernova3.1k
1
gravatar for hpmcwill
4.3 years ago by
hpmcwill1.1k
United Kingdom
hpmcwill1.1k wrote:

A couple of options that come to mind:

  • EMBOSS seqretsplit
  • csplit, for example: csplit -f outfile. input.fa '%^>%' '/^>/' '{*}'
ADD COMMENTlink written 4.3 years ago by hpmcwill1.1k
0
gravatar for BioApps
20 months ago by
BioApps720
Spain
BioApps720 wrote:

This tool I built will split a multiFasta, SFF, FastQ file in individual OR smaller-block fasta files. The tool works also on Linux via WinE without any drawbacks. It can also convert between different formats and do sequence processing as FastQC.

split mutiple fasta to single/indiviual fasta

ADD COMMENTlink modified 20 months ago • written 20 months ago by BioApps720
0
gravatar for iuli.151293
17 months ago by
iuli.1512930 wrote:

If you want to separate a multi-fasta file, you can use the above script but you have to delete the fasta header. if you want a faster way, you can use the following script. It allows you to separate each sequence in an individual fasta file and the name of that file will be the first 11 characters after the ">" without deleting the header.

while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=${line:1:11}.fa
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done < myseq.fa
ADD COMMENTlink written 17 months ago by iuli.1512930
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1768 users visited in the last hour