Question

Extracting organism and seq from fasta

0

Entering edit mode

2.3 years ago

joshuazvid • 0

Hi,

I am trying to extract sequences from a fasta file from a database with a specific organism species keyword from a .txt file containing the relevant headers. Do you know how I can do this in python as the biopython guide I've looked at basically said "you're screwed if your files aren't .gb".

The header looks like this:

>VFG000361(gb|WP_000982866) (ybtU) yersiniabactin biosynthetic protein YbtU [Yersiniabactin (VF0136) - Nutritional/Metabolic factor (VFC0272)] [Yersinia pestis CO92]

my current script only splits up the header by word in header but this returns different keywords as opposed to just the organism name each time by which I can filter the fasta.

all_species = []
for seq_record in SeqIO.parse("InputFile.fas", "fasta"):
    all_species.append(seq_record.description.split()[8])
print(all_species)

biopython python fasta • 1.6k views

ADD COMMENT • link updated 3 months ago by Ram 43k • written 2.3 years ago by joshuazvid • 0

0

Entering edit mode

Your post is confusing. Post example record from text file and example fasta record. From OP, I can see fasta header and code to parse fasta file. Post example input and expected output.

from Bio import SeqIO
import re

all_species = [re.split("\[|\]", seq_record.description).pop(-2) for seq_record in SeqIO.parse("test.fa", "fasta")]
print (all_species)

Should give you the species in the fasta file assuming that organism name is present always at the end (Yersinia pestis CO92 within [] brackets)

You do not need to use biopython libraries if you are parsing only headers starting with >. Following should work:

import re

f= open("test.fa", "r")
all_species=[re.split("\[|\]", i).pop(-2) for i in f  if i.startswith(">")]
print(all_species)

ADD REPLY • link 2.3 years ago by cpad0112 21k

0

Entering edit mode

To provide context for readers finding this, the reason the BioPython guide presents the view 'Here Be Dragons' in regards to getting an organism's name from a FASTA-formatted sequence entry is because the description line has no standard in the FASTA format. Thus, there is no guarantee that specific code will work. If your FASTA entries have description lines that follow a pattern, such as the common one where the organism name is in brackets at the end that @cpad0112's solution addresses, then you can use custom Python code to do the parsing task once you have the description line. (Actually, I have a script here called add_source_organism_info_to_FASTA.py that will do this. It uses the GI number to look up the Genbank entry and get the information from there, and so it has to have that or it fails, and so it isn't universal. Plus, as posted here GI numbers are being phased out.)

That being said you may want to build in to the script giving feedback or a warning when encountering ones that don't follow the pattern necessary.

Also, as an alternative to using BioPython when you are just trying to read description lines in FASTA files, there's pyfaidx that is specifically meant for working with FASTA formatted sequence files. You can iterate on the entries and get the description line as text for each using record.name or record.long_name depending if you use genes = Fasta('tests/data/genes.fasta') or genes = Fasta('tests/data/genes.fasta', read_long_names=True), respectively, to bring the entries into memory from the file. Look for the corresponding example under Usage by searching for the text 'record.name' here.

ADD REPLY • link 2.3 years ago by Wayne ★ 2.0k

1

Entering edit mode

pyfaidx fails to read fasta files if they have identical headers where as a generic parser or biopython fasta parser can read files with identical headers. `

ADD REPLY • link 2.3 years ago by cpad0112 21k