Question

What is NCBI Gene ID, where to find it and how to convert to entrez ID?

1

Entering edit mode

5.1 years ago

mnazir ▴ 10

Hi,

I am new to bioinformatics and starting to learn recently. I have a question about gene ID if someone can guide me. I want to upload the RNA Seq data to Kegg Exp to draw pathway on the basis of differential expression analysis. The file requires the gene symbols and GENE ID which I don't know where to find for the microorganism I am working with i.e. Clostridium beijerinckii ATCC 35702. I have read paper and information about it that it is a number to identify genes specifically but I am struggling with finding the source where to look for it.

Thanks a lot

RNA-Seq • 3.5k views

ADD COMMENT • link updated 17 months ago by Pegasus ▴ 110 • written 5.1 years ago by mnazir ▴ 10

0

Entering edit mode

Genome page for this bacterium is available here. If you look at the "GFF" link (search for that word) you can find the annotation file for this genome.

GeneID are present in this file but there are no gene symbols. Here is an example snippet from the file.

NZ_CP010086.2   Protein Homology        CDS     226406  228025  .       +       0       ID=cds207;Parent=gene235;Dbxref=Genbank:WP_017209683.1,GeneID:31661091;Name=WP_017209683.1;gbk
ey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_008426658.1;product=phosphoenolpyruvate--protein phosphotransferase;protein_id=WP_017209683.1;transl_table=11
NZ_CP010086.2   RefSeq  gene    228170  229435  .       +       .       ID=gene236;Dbxref=GeneID:31661092;Name=LF65_RS01175;gbkey=Gene;gene_biotype=protein_coding;locus_tag=LF65_RS01
175;old_locus_tag=LF65_00235
NZ_CP010086.2   Protein Homology        CDS     228170  229435  .       +       0       ID=cds208;Parent=gene236;Dbxref=Genbank:WP_041893490.1,GeneID:31661092;Name=WP_041893490.1;gbk
ey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_017209682.1;product=hypothetical protein;protein_id=WP_041893490.1;transl_table=11
NZ_CP010086.2   RefSeq  gene    229596  230771  .       +       .       ID=gene237;Dbxref=GeneID:31661093;Name=LF65_RS01180;gbkey=Gene;gene_biotype=protein_coding;locus_tag=LF65_RS01
180;old_locus_tag=LF65_00236
NZ_CP010086.2   Protein Homology        CDS     229596  230771  .       +       0       ID=cds209;Parent=gene237;Dbxref=Genbank:WP_041893492.1,GeneID:31661093;Name=WP_041893492.1;gbk
ey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_011967550.1;product=beta-aspartyl-peptidase;protein_id=WP_041893492.1;transl_table=11
NZ_CP010086.2   RefSeq  gene    230842  231447  .       -       .       ID=gene238;Dbxref=GeneID:31661094;Name=LF65_RS01185;gbkey=Gene;gene_biotype=protein_coding;locus_tag=LF65_RS01
185;old_locus_tag=LF65_00237

ADD REPLY • link 5.1 years ago by GenoMax 147k

0

Entering edit mode

Thanks a lot for your comment really appreciate it. NZ_CP010086.2 is the strain NCIMB14988 however, I am looking for Strain Accession number CP006777.1 strain ATCC 35702 which I am not able to obtain. Does that mean the GENE ID information for this strain has not been submitted and I can use NZ_CP010086.2 strain information as reference? or is there anything that I am missing on this?

ADD REPLY • link 5.1 years ago by mnazir ▴ 10

0

Entering edit mode

Here is the page for ATCC 35702. GFF file for this strain does not have GeneID or locus information.

NZ_CP006777.1   Protein Homology        CDS     31554   32342   .       +       0       ID=cds19;Parent=gene29;Dbxref=Genbank:WP_011967385.1;Name=WP_011967385.1;gbkey=CDS;inference=COORDINATES: sim
ilar to AA sequence:RefSeq:WP_011967385.1;product=Cof-type HAD-IIB family hydrolase;protein_id=WP_011967385.1;transl_table=11
NZ_CP006777.1   RefSeq  gene    32490   34085   .       -       .       ID=gene30;Name=CBS_RS00155;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CBS_RS00155;old_locus_tag=Cbs_0021
NZ_CP006777.1   Protein Homology        CDS     32490   34085   .       -       0       ID=cds20;Parent=gene30;Dbxref=Genbank:WP_011967386.1;Name=WP_011967386.1;gbkey=CDS;inference=COORDINATES: sim
ilar to AA sequence:RefSeq:WP_015390206.1;product=heme ABC transporter ATP-binding protein;protein_id=WP_011967386.1;transl_table=11

ADD REPLY • link 5.1 years ago by GenoMax 147k

0

Entering edit mode

Thanks a lot. Can you share the link where you're looking for this information I am going to NCBI and then I select genome from the database drop down menu but I don't get the same hit as yours. why is that so? Moreover, can you please guide me how can I convert GENE ID to correspoding Entrez ID.

Is it possible that a strain not having GENE ID will also not have corresponding entrez iD number? Or it can still have entrezID numbers?

ADD REPLY • link 5.1 years ago by mnazir ▴ 10

0

Entering edit mode

I provided direct links in my comment above which should open the right page when you click on then (blue highlighted text).

Some of the annotation you see here may have been done by automated programs that will generate these types of annotation. Someone has to manually do the work to verify that annotation. One of the reasons it is cheap to sequence things but much more expensive to properly annotate.

ADD REPLY • link 5.1 years ago by GenoMax 147k

0

Entering edit mode

Yeah You're right manually annotation is needed to have all the pieces together which I am trying to do for our lab strain. I have been adding information manually to the table of our strain. Right now I am struggling with this geneID to entrez ID conversion problem. As I wish to draw pathways of differentially expressed genes on Keggexp and it accepts only entrezID. I hope I am able to add a better annotated table for at least one strain for people to be able to find everything at one place in future. Thanks for your valued comments

ADD REPLY • link 5.1 years ago by mnazir ▴ 10

1

Entering edit mode

I am not sure what you are planning to do with this, because KeggExp does not even support Clostridium spp. IDs:

ADD REPLY • link 5.1 years ago by Kevin Blighe 88k

1

Entering edit mode

I came across this paper (Specialized activities and expression differences for Clostridium thermocellum biofilm and planktonic cells.(https://www.nature.com/articles/srep43583)) where they have used Kegg EXP to show differential expression analysis in pathways. This is what I am trying to learn. Kegg EXP mentions on the site that you can choose any of the closely related species in the above drop down menu and go ahead with further analysis. Since it asks you for the corresponding Entrez ID and gene ID it will match with the right organism I think.

ADD REPLY • link 5.1 years ago by mnazir ▴ 10

0

Entering edit mode

Hi,

I am facing the exact issue, have u had luck to solve it ? If yes, please share here.

Thanks

ADD REPLY • link 17 months ago by Pegasus ▴ 110

0

Entering edit mode

I suggest you create a new question, detailing your problem with approaches you have taken and issues you're having, instead of necro-bumping a 4-year-old question with unclear comments.

ADD REPLY • link 17 months ago by barslmn ★ 2.3k

0

Entering edit mode

@Pegasus you are not going to be able to get Entrez ID for the data you have (pretty sure this question is following up on Converting RefSeq protein accession IDs into entreZ IDs )

ADD REPLY • link 17 months ago by GenoMax 147k

0

Entering edit mode

Hi GenoMax,

Actually I asked same question here cause I couldn't find a solution on the main question I asked in details in my post u mentioned. If no enreZ, how can I advance the rna-seq analysis to GO and Kegg pathway ?

ADD REPLY • link 17 months ago by Pegasus ▴ 110