Question

Get Gene Coding Sequence Using Gene Name/Id In Biopython

4

Entering edit mode

11.7 years ago

Ash ▴ 40

Maybe this is something really obvious, but what's the best way to get the coding sequence of a gene (main/reference isoform, if that makes a difference) with biopython when you have just the gene name or gene ID.

You can, obviously, get the coding region's locations, parse that information, and pull the coding sequence from the genome, but there's got to be a better way? Is the full coding sequence not stored somewhere, or accessible through a single call rather than building from scratch based on positional information?

biopython ncbi • 11k views

ADD COMMENT • link updated 11.7 years ago by Peter 6.0k • written 11.7 years ago by Ash ▴ 40

score 5 · Answer 1 · 2012-08-21

If you have the gene name or gene ID as used by the NCBI, you could use Bio.Entrez to connect to the NCBI Entrez web API and download the sequence (see the EFetch call).

If you have the gene name or gene ID and a matching GenBank/EMBL format file (e.g. for the genome or chromosome), you should be able to parse that (with Bio.SeqIO), find the feature of interest (a SeqFeature object), and use the feature object's extract method to pull of the sequence (taking care of the co-ordinates and strand for you).

For both those operations, I refer you to the Tutorial - http://biopython.org/DIST/docs/tutorial/Tutorial.html

If neither of those apply, then what kind of gene name/ID do you have?