Split the multiple sequences file into a separate files
2
2
Entering edit mode
6.6 years ago
skjobs1234 ▴ 40

I have a file contain multiple gene >sequences. I want to separate into a file with gene ID perl or python script

sequence • 9.3k views
ADD COMMENT
1
Entering edit mode

Can you show an example?

ADD REPLY
0
Entering edit mode

Search this forum for any regular question. These were answered many times. (You'll get used to this if you are a new user)

Here is one of many solutions on this forum: How To Split A Multiple Fasta

BTW, it is a question, not a tutorial.

ADD REPLY
0
Entering edit mode
  1. Have you searched the forum for similar questions?
  2. Why Perl or Python specifically? Why not awk or other unix tools?
ADD REPLY
6
Entering edit mode
6.6 years ago
Joe 21k

As others have mentioned this is answered a lot on the forum. But I can't help myself when it comes to trying to make bash do this sort of thing (sequences will have to be linearised).

#!/bin/bash

i=1;
while read line ; do
  if [ ${line:0:1} == ">" ] ; then
    echo "$line" >> seq"${i}".fasta
  else
    echo "$line" >> seq"${i}".fasta
    ((i++))
  fi
done < $1

Usage:

$ bash splitfasta.sh multifasta.fasta

Disclaimer:

You should always use a proper parser though (like biopython) as it'll catch many of the special cases. My code just has the bonus of not requiring anything to be installed to run.

ADD COMMENT
0
Entering edit mode

A quick awk for linearizing the sequences:

awk '$0~/^>/{if(NR>1){print sequence;sequence=""}print $0}$0!~/^>/{sequence=sequence""$0}END{print sequence}' "$1"
ADD REPLY
5
Entering edit mode
6.6 years ago
Renesh ★ 2.2k

This is a python script for splitting FASTA file into an individual file.

from Bio import SeqIO
import argparse

parser = argparse.ArgumentParser(description="Split the fasta file into individual file with each gene seq")
parser.add_argument('-f', action='store', dest='fasta_file', help='Input fasta file')
result = parser.parse_args()

f_open = open(result.fasta_file, "rU")

for rec in SeqIO.parse(f_open, "fasta"):
   id = rec.id
   seq = rec.seq
   id_file = open(id, "w")
   id_file.write(">"+str(id)+"\n"+str(seq))
   id_file.close()

f_open.close()

To run above code, (save the above code in code.py file)

python code.py -f fasta_file

Note: You need to install Biopython module SeqIO to run this code.

ADD COMMENT
0
Entering edit mode

you should add a note: this script requires biopython module installed

ADD REPLY
0
Entering edit mode

Hello,

I tried to use this code, but PYZO kept giving me error message. Any resolution? Thanks!

Running script: "C:\Users\14805\Desktop\python test\New folder\Splitthefastafile.py"
C:\Users\14805\Desktop\python test\New folder\Splitthefastafile.py:9: DeprecationWarning: 'U' mode is deprecated
  f_open = open('result.fasta_file', "rU")
Traceback (most recent call last):
  File "C:\Users\14805\Desktop\python test\New folder\Splitthefastafile.py", line 9, in <module>
    f_open = open('result.fasta_file', "rU")
FileNotFoundError: [Errno 2] No such file or directory: 'result.fasta_file'
ADD REPLY
0
Entering edit mode

You haven't told it where the target fasta file is or have gotten the path/filename wrong, so it cannot find it.

Also, don't use spaces in file/folder names.

ADD REPLY

Login before adding your answer.

Traffic: 2975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6