Question: extracting Gene ID from Genbank accession numbers in blast tabular file
0
gravatar for zgayk
4.2 years ago by
zgayk90
United States
zgayk90 wrote:

 

 

 

Hello,

I have a blast 24 column tabular file constructed after blasting the longest contigs in our genome assembly. The file shows the aligned query and subject sequences, as well as the genbank accession number of the subject sequence. The results for the first two sequence hits are shown below: 

Since the gi numbers are not particularly useful by themselves, I am wondering how difficult it would be to develop a python script that would look up each gi number and replace it with the actual gene id. I don't care if this would be in a separate file, as I'm assuming I could merge these back into the main tabular file.

Any help you might have would be greatly appreciated.

Zach Gayk

                                   
blast • 2.8k views
ADD COMMENTlink modified 4.2 years ago by michael.ante3.3k • written 4.2 years ago by zgayk90

 

..

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by zgayk90
0
gravatar for michael.ante
4.2 years ago by
michael.ante3.3k
Austria/Vienna
michael.ante3.3k wrote:

You can use the commandline 'join' for this purpose. You need next to the blast table, a table linking gi number to gene name. Order both tables after their gi number and then join them with it. Please see for instance these examples here.

In python, you can read both files and store them as dictionaries and link them over the key.

Cheers,

Michael

ADD COMMENTlink written 4.2 years ago by michael.ante3.3k
1

To precise a bit this answer: go to http://www.ensembl.org/biomart/martview/, select EnsemblGene80 and your organism.Go to the Attributes panel, and select Features. In the GENE submenu, select the Ensembl Gene ID, Associated Gene Name. In the EXTERNAL/External References submenu, select EntrezGene ID and EMBL (Genbank) ID.

Download the results (no need for filters, export as TSV for ease of use), and do a join using the EMBL ID column as key. I used Ensembl IDs here, but you get the idea: grab a conversion table, and use it to join your genomic data to a useable ID using the Genbank ID.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by cyril-cros890

Would this still work of many of my blast results are for non-model organisms for which there is not a complete ensembl gene dataset?

Thanks,

Zach

ADD REPLYlink written 4.2 years ago by zgayk90

 

I looked up your example about joining and that seems like I should be able to do that. But what I was originally asking was whether there is a way to automate the process of retrieving the genbank gene ids from the gi numbers listed. For each blast hit there is a gi number of the subject sequence that had the highest percent identity match to a contig in my assembly. Because there are thousands of these blast hits, it would be much more useful to find all of the actual gene ids for each blast match so that I can easily look at which gene matches to a particular contig. 

Thanks,

Zach

ADD REPLYlink written 4.2 years ago by zgayk90

Hi Zach,

you can use awk/python/perl/cut to extract a table containing only the gi number and the query id. E.g. creating an associative array with gi and query as the key value. Subsequently, you can use the join command to combine this table with the gi number to gene name table. Which then should be a three column table with contig id, gi number, and gene name.

Cheers,

Michael

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by michael.ante3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1522 users visited in the last hour