Question: Extracting fasta sequences with identical header
0
gravatar for seta
2.6 years ago by
seta1.2k
Sweden
seta1.2k wrote:

Hi all friends,

I have a large fasta file that most sequences have a identical header (they differ from the length). I usually extracted the sequences of interest with the following python codes:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Extract sequences from a fasta file if their name is in a 'wanted' file.

Wanted file contains one sequence name per line.

Usage:
%program <input_file> <wanted_file> <output_file>"""

import sys
import re

try:
from Bio import SeqIO
except:
print "This program requires the Biopython library"
sys.exit(0)

try:
fasta_file = sys.argv[1]  # Input fasta file
wanted_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file
except:
print __doc__
sys.exit(0)

wanted = set()
with open(wanted_file) as f:
for line in f:
    line = line.strip()
    if line != "":
        wanted.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
for seq in fasta_sequences:
    name = seq.id
    if name in wanted and len(seq.seq.tostring()) > 0:
        wanted.remove(name) # Output only the first appearance for a name
        SeqIO.write([seq], f, "fasta")

But the code just extracted one of the sequences with identical header, while I want to all sequences of interest that may have identical headers and adding the sequential number at the beginning or end of identical headers in the extracted fasta file? something like:

header_1
header_2
header_3
etc.

Many thanks

ADD COMMENTlink modified 2.6 years ago by Brian Bushnell16k • written 2.6 years ago by seta1.2k
3
gravatar for a.zielezinski
2.6 years ago by
a.zielezinski8.9k
a.zielezinski8.9k wrote:

I didn't test it:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Extract sequences from a fasta file if their name is in a 'wanted' file.

Wanted file contains one sequence name per line.

Usage:
%program <input_file> <wanted_file> <output_file>"""

import sys
import re

try:
    from Bio import SeqIO
except:
    print("This program requires the Biopython library")
    sys.exit(0)

try:
    fasta_file = sys.argv[1]  # Input fasta file
    wanted_file = sys.argv[2] # Input wanted file, one gene name per line
    result_file = sys.argv[3] # Output fasta file
except:
    print(__doc__)
    sys.exit(0)

wanted = {} # CHANGE
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted[line] = 0 # CHANGE

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted and len(seq.seq.tostring()) > 0:
            wanted[seq.id] += 1     # CHANGE
            seq.id += '_{0}'.format(wanted[seq.id]) # CHANGE
            #wanted.remove(name) # Output only the first appearance for a name
            SeqIO.write([seq], f, "fasta")
ADD COMMENTlink modified 11 weeks ago • written 2.6 years ago by a.zielezinski8.9k

Thanks, I name it py_ext.py, but it retutned the below error:

File "py_ext.py", line 21
    fasta_file = sys.argv[1] # Input fasta file
             ^
IndentationError: expected an indented block

I tried to convert tab to space, but the error still remain. Would you please check the code?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by seta1.2k
1

I edited my answer. The code you pasted in your question has the problem with indentations (no idents after try: or with:) so it doesn't work at all. I would test my code if you provide some sample lines of your input files.

ADD REPLYlink written 2.6 years ago by a.zielezinski8.9k
1

Thank you very much. It worked well and I accepted it.

ADD REPLYlink written 2.6 years ago by seta1.2k
File "extractentr.py", line 17
    print "This program requires the Biopython library"
                                                      ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("This program requires the Biopython library")?

and after changing it

 File "extractentr.py", line 25
    print __doc__
                ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(__doc__)?
ADD REPLYlink written 11 weeks ago by Shahzad10

I updated the code to Python3.

ADD REPLYlink written 11 weeks ago by a.zielezinski8.9k
1
gravatar for Brian Bushnell
2.6 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can use BBMap's FilterByName tool like this:

filterbyname.sh in=x.fa out=y.fa names=z.fa include=t substring

That will send everything to z.fa with headers matching those between x.fa and y.fa. The flag "substring" allows substring matches, which seems to be what you are seeking.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1989 users visited in the last hour