Question

extracting Gene ID from Genbank accession numbers in blast tabular file

0

Entering edit mode

8.8 years ago

zgayk ▴ 90

Hello,

I have a blast 24 column tabular file constructed after blasting the longest contigs in our genome assembly. The file shows the aligned query and subject sequences, as well as the genbank accession number of the subject sequence. The results for the first two sequence hits are shown below:

Since the gi numbers are not particularly useful by themselves, I am wondering how difficult it would be to develop a python script that would look up each gi number and replace it with the actual gene id. I don't care if this would be in a separate file, as I'm assuming I could merge these back into the main tabular file.

Any help you might have would be greatly appreciated.

Zach Gayk

blast • 4.3k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by zgayk ▴ 90

0

Entering edit mode

..

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.8 years ago by zgayk ▴ 90

Ram · Answer 1 · 2015-06-30

0

Entering edit mode

8.8 years ago

michael.ante ★ 3.8k

You can use the command line 'join' for this purpose. You need next to the blast table, a table linking gi number to gene name. Order both tables after their gi number and then join them with it. Please see for instance these examples here.

In python, you can read both files and store them as dictionaries and link them over the key.

Cheers,
Michael

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.8 years ago by michael.ante ★ 3.8k

1

Entering edit mode

To precise a bit this answer: go to http://www.ensembl.org/biomart/martview/, select EnsemblGene80 and your organism.Go to the Attributes panel, and select Features. In the GENE submenu, select the Ensembl Gene ID, Associated Gene Name. In the EXTERNAL/External References submenu, select EntrezGene ID and EMBL (Genbank) ID.

Download the results (no need for filters, export as TSV for ease of use), and do a join using the EMBL ID column as key. I used Ensembl IDs here, but you get the idea: grab a conversion table, and use it to join your genomic data to a useable ID using the Genbank ID.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by cyril-cros ▴ 950

0

Entering edit mode

Would this still work of many of my blast results are for non-model organisms for which there is not a complete ensembl gene dataset?

Thanks,
Zach

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.8 years ago by zgayk ▴ 90

0

Entering edit mode

I looked up your example about joining and that seems like I should be able to do that. But what I was originally asking was whether there is a way to automate the process of retrieving the genbank gene ids from the gi numbers listed. For each blast hit there is a gi number of the subject sequence that had the highest percent identity match to a contig in my assembly. Because there are thousands of these blast hits, it would be much more useful to find all of the actual gene ids for each blast match so that I can easily look at which gene matches to a particular contig.

Thanks,
Zach

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.8 years ago by zgayk ▴ 90

0

Entering edit mode

Hi Zach,

you can use awk/python/perl/cut to extract a table containing only the gi number and the query id. E.g. creating an associative array with gi and query as the key value. Subsequently, you can use the join command to combine this table with the gi number to gene name table. Which then should be a three column table with contig id, gi number, and gene name.

Cheers,

Michael

ADD REPLY • link 8.8 years ago by michael.ante ★ 3.8k