Question: Converting BLAST Alignments (NCBI database) to Gene ID
1
gravatar for jeremy.cox.2
4.0 years ago by
jeremy.cox.290
United States
jeremy.cox.290 wrote:

Hello All,

This is probably a "newbie" question.

I am trying to take some standard BLAST output and map the alignments to Gene ID's, so that I can do enrichment/network analysis.

Now I am doing something out of the ordinary: I am looking at multiple microorganisms at once.  I think this might be a major difficulty in converting: some databases may not include homologs or hypothetical proteins.  However, I am very new to this problem, having no previous knowledge of Gene ID systems.

Here is my output, blasting against an NCBI database.  (Obviously, I have thousands of lines, this is just a random example.)

queryNAME  gi|367018053|ref|NC_016508.1|   90.20   51      5       0       1       51      788427  788477  1e-09   67.6

 

so I can easily find this in NCBI database

http://www.ncbi.nlm.nih.gov/nuccore/367018053
and then
http://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=full_report&list_uids=11505342

I can easily parse all this from NCBI using Edirect
"efetch -db nuccore -id "NC_016508" -mode xml"

 

So, I now have three names:

GI    367018053
ACCESSION NC_016508
Gene symbol   TDEL0H00120

There are many posts about plenty of available Gene ID converters. https://www.biostars.org/p/22/  However, I seem to have a "Catch-22": I don't know what database these ID's belong to, which is ultimately necessary for converting to another system.  (I mean, I generally know what these are, but apparently I need to be very specific in selecting from a big list of possibilities.)  On the other hand, maybe I am being unsuccessful because this is a hypothetical gene, so there is nothing to convert it to in other lists.  

Can anyone offer some guidance on (1) how to convert these successfully and (2) more generally, are there special issues to consider when not using a single organism?

 

blast gene id ncbi • 2.5k views
ADD COMMENTlink modified 4.0 years ago by Jean-Karim Heriche21k • written 4.0 years ago by jeremy.cox.290

I've used tblastx against RefSeq databases for similar work. Are you using one of these?

ADD REPLYlink written 4.0 years ago by burkhart.joshua30
0
gravatar for Jean-Karim Heriche
4.0 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche21k wrote:

All these are GenBank identifiers. They are explained here.

ADD COMMENTlink written 4.0 years ago by Jean-Karim Heriche21k

Yes.  So for example, I would expect these ID's to convert using the uniprot converter

http://www.uniprot.org/uploadlists/

However, identifying these as "GI number*", "EMBL/GenBank/ DDBJ" returns no results.

ADD REPLYlink written 4.0 years ago by jeremy.cox.290

It looks like you don't get IDs that correspond/map to proteins. As you point out, your example is a hypothetical gene so it may not be represented by a protein in UniProt.
If you're trying to identify UniProt proteins, why not blastx your nucleotide sequences directly against a UniProt database ?

ADD REPLYlink written 4.0 years ago by Jean-Karim Heriche21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1148 users visited in the last hour