Question

Extracting fasta sequences with identical header

0

Entering edit mode

7.1 years ago

seta ★ 1.9k

Hi all friends,

I have a large fasta file that most sequences have a identical header (they differ from the length). I usually extracted the sequences of interest with the following python codes:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Extract sequences from a fasta file if their name is in a 'wanted' file.

Wanted file contains one sequence name per line.

Usage:
%program <input_file> <wanted_file> <output_file>"""

import sys
import re

try:
from Bio import SeqIO
except:
print "This program requires the Biopython library"
sys.exit(0)

try:
fasta_file = sys.argv[1]  # Input fasta file
wanted_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file
except:
print __doc__
sys.exit(0)

wanted = set()
with open(wanted_file) as f:
for line in f:
    line = line.strip()
    if line != "":
        wanted.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
for seq in fasta_sequences:
    name = seq.id
    if name in wanted and len(seq.seq.tostring()) > 0:
        wanted.remove(name) # Output only the first appearance for a name
        SeqIO.write([seq], f, "fasta")

But the code just extracted one of the sequences with identical header, while I want to all sequences of interest that may have identical headers and adding the sequential number at the beginning or end of identical headers in the extracted fasta file? something like:

header_1
header_2
header_3
etc.

Many thanks

fasta extracting identical header • 3.0k views

ADD COMMENT • link updated 7.1 years ago by Brian Bushnell 20k • written 7.1 years ago by seta ★ 1.9k

1

Entering edit mode

7.1 years ago

Brian Bushnell 20k

You can use BBMap's FilterByName tool like this:

filterbyname.sh in=x.fa out=y.fa names=z.fa include=t substring

That will send everything to z.fa with headers matching those between x.fa and y.fa. The flag "substring" allows substring matches, which seems to be what you are seeking.

ADD COMMENT • link 7.1 years ago by Brian Bushnell 20k

score 3 · Accepted Answer · 2017-03-24

3

Entering edit mode

7.1 years ago

Andrzej Zielezinski 11k

I didn't test it:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Extract sequences from a fasta file if their name is in a 'wanted' file.

Wanted file contains one sequence name per line.

Usage:
%program <input_file> <wanted_file> <output_file>"""

import sys
import re

try:
    from Bio import SeqIO
except:
    print("This program requires the Biopython library")
    sys.exit(0)

try:
    fasta_file = sys.argv[1]  # Input fasta file
    wanted_file = sys.argv[2] # Input wanted file, one gene name per line
    result_file = sys.argv[3] # Output fasta file
except:
    print(__doc__)
    sys.exit(0)

wanted = {} # CHANGE
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted[line] = 0 # CHANGE

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted and len(seq.seq.tostring()) > 0:
            wanted[seq.id] += 1     # CHANGE
            seq.id += '_{0}'.format(wanted[seq.id]) # CHANGE
            #wanted.remove(name) # Output only the first appearance for a name
            SeqIO.write([seq], f, "fasta")

ADD COMMENT • link 4.6 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thanks, I name it py_ext.py, but it retutned the below error:

File "py_ext.py", line 21
    fasta_file = sys.argv[1] # Input fasta file
             ^
IndentationError: expected an indented block

I tried to convert tab to space, but the error still remain. Would you please check the code?

ADD REPLY • link 7.1 years ago by seta ★ 1.9k

1

Entering edit mode

I edited my answer. The code you pasted in your question has the problem with indentations (no idents after try: or with:) so it doesn't work at all. I would test my code if you provide some sample lines of your input files.

ADD REPLY • link 7.1 years ago by Andrzej Zielezinski 11k

1

Entering edit mode

Thank you very much. It worked well and I accepted it.

ADD REPLY • link 7.1 years ago by seta ★ 1.9k

0

Entering edit mode

File "extractentr.py", line 17
    print "This program requires the Biopython library"
                                                      ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("This program requires the Biopython library")?

and after changing it

 File "extractentr.py", line 25
    print __doc__
                ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(__doc__)?