How to parse a gene's location using biopython
Entering edit mode
15 months ago
Zoya • 0

Hi, I'm trying to extract gene location information for certain genes across multiple bacteria. Currently I'm using this set up to retrieve the gene information:

from Bio import SeqIO, Entrez = ''
# example is E. coli K-12 reference sequence
handle = Entrez.efetch(db="nuccore", id='NC_000913', rettype='fasta_cds_na', retmode="text")
record = SeqIO.parse(handle, "fasta")
for gene in record:

with parse_location() being a function I wrote myself in order to extract the start and end point of a gene on the sequence.

It is my understanding that with the use of SeqIO.parse each gene in the record comes with features, including features for the start and end point of the gene. Below is the output of a random gene in NC_000913:

ID: lcl|NC_000913.3_cds_NP_418812.1_4306
Name: lcl|NC_000913.3_cds_NP_418812.1_4306
Description: lcl|NC_000913.3_cds_NP_418812.1_4306 [gene=ytjC] [locus_tag=b4395] [db_xref=UniProtKB/Swiss-Prot:P0A7A2] [protein=putative phosphatase] [protein_id=NP_418812.1] [location=4633797..4634444] [gbkey=CDS]
Number of features: 0

My parse_location() function works okay, but I'm worried about edge cases. Is there anyway to have biopython do it for me?

python biopython EUtils • 890 views
Entering edit mode

Within biopython's object model, features have a specific location start and end attribute.

If you take a look at around line 40 in the github gist here: Slicing Genbank File by 'gene_id' range

you can see the syntax at work.

Regarding edge cases, biopython should catch most mangled files or dodgy features, so you probably don't have to worry too much as long as you're using standard formats. Errors in the input files would be the most likely issue I think.

Depending on the feature type (e.g. CDS etc) you may find the locations are 'split' if they happen to span the end of a sequence, so that's a possible edge case to be on the lookout for, but biopython should still handle it fairly gracefully.

Entering edit mode

Thanks, I knew this was done by biopython somewhere somehow, but it turns out I was using the wrong rettype. I tried 'fasta_cds_na' and this does get you all information on the genes on the sequence, but does not get parsed like in your example, but instead gives you what I got: zero features. I should have used rettype='gbwithparts' in order to get all information on for each gene on the requested sequence. Rettype 'gb' only gets you some information about the sequence, but not the genes.


Login before adding your answer.

Traffic: 2402 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6