how to convert a long fasta-file into many separate single fasta sequences
7
5
Entering edit mode
7.2 years ago
natasha.sernova ★ 3.9k

Hello, everybody!

I am looking for the procedure reversed to cat.

I would like to separate a long fasta-file

into many single fasta sequences, so

each sequence from the large fa-file will be situated in a new

small file.fa each. I usually used something like the following,

but it not as good as it looks like... Where is the mistake?

Many thanks for your help!

N.

sub get_fasta {

  my $inputName =$_[0];

  my $fh; # new input record separator for fasta-files $/ = "\n>";

open ($fh,$inputName) or die "cannot open file: $!\n"; my @entries = <$fh>;

 # remove last character '>' except for the last entry

  chop(@entries[0..$#entries - 1]); # add '>' except for the 1st entry  foreach (@entries[1..$#entries]) {$_ = '>'.$_;}

# print"@entries";

close($fh); $/="\n";

  return \@entries;

sequence fasta • 27k views
0
Entering edit mode

Since you were trying for a Perl solution, this answer will help: Splitting A Fasta File

Here are some more solutions for you in other posts: Split Large Fasta Into Mulitple Files, Can'T Name Them With Gi Number

0
Entering edit mode

Thank you, I've missed really a lot.

Natasha

19
Entering edit mode
7.2 years ago

Another option in bash... Possibly more concise and readable...

while read line
do
if [[ ${line:0:1} == '>' ]] then outfile=${line#>}.fa
echo $line >$outfile
else
echo $line >>$outfile
fi
done < myseq.fa


You input file is myseq.fa. Individual files will have the name of the corresponding sequence, without leading >. If your sequence names contain funny characters (non alphanumeric) it will probably break but could be fixed depending exactly how you want to call the output files.

1
Entering edit mode

Hey all, I just tried to use this script and got an error for ambiguous redirect. Any ideas? Thanks in advance.

0
Entering edit mode

Hi, placing quotation marks around the $outfile (ie. "$outfile") should fix it. Hope you have figured it out earlier

0
Entering edit mode

nice use of bash substitutions.

0
Entering edit mode

Perfect! Now it has given even the sub-sequence names from the large fasta as

each small fasta-file name! Really great!

Thousand thanks!

Natasha

0
Entering edit mode

one of the best one-liners to split a fasta file.

0
Entering edit mode

How could this be modified to: 1. Instead name the output file(s) the same as the input file and perhaps appended with the corresponding sequence name? 2. Run as a loop so that this could be used to split hundreds of different input files (each with different names corresponding to gene names).

for file in $(ls *.fasta) do while read line do if [[${line:0:1} == '>' ]]
then
outfile=$file.${line#>}.fasta
echo $line >$outfile
else
echo $line >>$outfile
fi
done
done


I can't seem to figure it out.

11
Entering edit mode
7.2 years ago

linearize your fasta sequence (two lines per sequence) and use split -l 2

cat *.fasta |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\
split -l 2 - seq_

0
Entering edit mode

Dear Pierre,

It's a very nice idea - to kill "cat " with the same weapon! I will definitely try this!

Thank you very much! It did work as well.

Natasha

0
Entering edit mode

It works very nice! plus i found that you can add the option --additional-suffix to add the extension you prefer.

cat *.fasta |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\
split -l 2 --additional-suffix=.fasta - seq_

0
Entering edit mode
This one could spilt a big file into each, but I would like use the header name as each file name, is that possible?
For example: here is a big file:
>OTU1
ATCGATCGATCGTCGATCG
...
>OTU1000
TCGATCGTCGATCGTCGATCGTCGATCG
The output should be like this:
OTU1.fasta
...
OTU1000.fasta

0
Entering edit mode
Here is the answer that I got, I hope to help novices:
1:first step to spilt into each file
cat *.fasta |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\
split -l 2  - OTU
2:second step to add extension
for file in OTU*
do
mv "$file" "$file.fasta"
done
3:third step to change the file name to the header name
for i in *.fasta; do
mv $i$(head -1 $i | cut -f1 -d ' ' | tr -d '>' ).fasta done By doing these, I got the output that what I want. Thanks for people who gave me an idea.  ADD REPLY 5 Entering edit mode 7.2 years ago Reading the whole genome into memory seems like a rather wasteful way to go about things. Just process things like by line (note, this is a random python script I have sitting around, you'd need to edit it): #!/usr/bin/env python f=open("genome.fa","r"); opened = False for line in f : if(line[0] == ">") : if(opened) : of.close() opened = True of=open("%s.fa" % (line[1:].rstrip()), "w") print(line[1:].rstrip()) of.write(line) of.close()  You'd obviously need to change the filename. This would also need to be changed if you have descriptors after a contig/chromosome name, but it's enough to give you the idea. ADD COMMENT 1 Entering edit mode Thanks for the script Devon! It works well on linux and unix based systems. I edited it with a bit of annotation. # GENERAL USAGE: python ./scripts/Divide_Reference.fa_by_Contig.py ./data/reference/reference.fa import os from os import path import sys infile=open(sys.argv[1]) os.system("cd /mnt/scratch/steepale/birdman/MDV_project/DNAseq/GATK_Variant_Analysis_Germline_DNA") # Add your project directory in here path = sys.argv[1].split('/')[0] + "/" + sys.argv[1].split('/')[1] + "/" + sys.argv[1].split('/')[2] ref_file = path + "/" + sys.argv[1].split('/')[3] print('\n' + "reference file: " + ref_file + '\n') num_contigs = 'grep "^>" %s | wc -l' % (ref_file) print("The number of contigs in reference fasta:") os.system(num_contigs) opened = False # Assume outfile is not open for line_ref in infile: if line_ref[0] == ">": # If line begins with ">" if(opened): outfile.close() # Will close the outfile if it is open (see below and follow loop) opened = True # Set opened to True to represent an opened outfile contig_name = line_ref[1:].rstrip() #Extract contig name: remove ">", extract contig string, remove any spaces or new lins following file print("contig: " + contig_name) print("Name of outfile: " + path + "/" + str(sys.argv[1].split('/')[3].split(".")[0]) + "_" + str(contig_name) + ".fa") outfile=open(path + "/" + str(sys.argv[1].split('/')[3].split(".")[0]) + "_" + str(contig_name) + ".fa", 'w') outfile.write(line_ref) # Write the line to the file. If line started with ">" a new output file was created and ready to be written to. # Subsequent lines (AKA the sequence) will also be written as the script loops through lines. # Not until we reach a new line beginning with ">" is the outfile closed and new outfile generated with new name. outfile.close() print "Fin"  ADD REPLY 0 Entering edit mode Looks good, if you really want to go whole hog then add argparse in there :) ADD REPLY 0 Entering edit mode Hello! Unfortunately, I'm very bad in Python. I will try to edit your script, but I am not quite sure of the result. Natasha ADD REPLY 0 Entering edit mode Dear Devon, I've run your script, it gave me just names of the sequences in the fasta-file, no sequences at all. ADD REPLY 0 Entering edit mode Perhaps you have Windows line endings, in which case open with "rU" rather than just "r". Other than that, what you're describing wouldn't be a possible outcome of what I posted unless the original file also contains only contig/chromosome names. ADD REPLY 0 Entering edit mode Dear Devon, I'm really sorry! It did work in the first time, and I now have a nice list of sepated fasta-files! Many-many thanks! Natasha ADD REPLY 3 Entering edit mode 7.2 years ago Easy Python script using a package I wrote: ADD COMMENT 0 Entering edit mode Great, thank you! Natasha ADD REPLY 1 Entering edit mode 7.2 years ago hpmcwill ★ 1.2k A couple of options that come to mind: • EMBOSS seqretsplit • csplit, for example: csplit -f outfile. input.fa '%^>%' '/^>/' '{*}' ADD COMMENT 1 Entering edit mode 4.4 years ago iuli.151293 ▴ 10 If you want to separate a multi-fasta file, you can use the above script but you have to delete the fasta header. if you want a faster way, you can use the following script. It allows you to separate each sequence in an individual fasta file and the name of that file will be the first 11 characters after the ">" without deleting the header. while read line do if [[${line:0:1} == '>' ]]
then
outfile=${line:1:11}.fa echo$line > $outfile else echo$line >> \$outfile
fi
done < myseq.fa

0
Entering edit mode
4.6 years ago
BioApps ▴ 780

This tool I built will split a multiFasta, SFF, FastQ file in individual OR smaller-block fasta files. The tool works also on Linux via WinE without any drawbacks. It can also convert between different formats and do sequence processing as FastQC.