Retrieving gene ID using transcripts ID from Entrez database using CLI or Batch Entrez
4
0
Entering edit mode
2.3 years ago
hans ▴ 20

Hello I have a long (>150K) list of transcript IDs like this (each ID in new line):

XM_044462096.1 XM_044462097.1 XM_044462098.1 XM_044462099.1 XM_044462100.1 XM_044462101.1 XM_044462102.1 XM_044462103.1 XM_044462104.1 XM_044462105.1

How do I retrieve from Entrez database the Gene ID of these transcripts? Gene ID like LOC122345. I can use NCBI tools CLI. I could not find in which database this information is present. Thank you

database ID gene transcript Entrez • 2.4k views
ADD COMMENT
1
Entering edit mode
2.3 years ago
GenoMax 143k

Using EntrezDirect:

$ for i in `cat ./id`; do printf ${i}"\t"; esearch -db gene -query ${i} | esummary | xtract -pattern DocumentSummary -element Name; done
XM_044462096.1  LOC123038202
XM_044462097.1  LOC123038204
XM_044462098.1  LOC123038205
ADD COMMENT
0
Entering edit mode

Thank you, it works fine. However, it is very slow. It took 60 sec for 10 IDs, I do not know where the bottleneck is.

ADD REPLY
0
Entering edit mode

Be sure to sign up for NCBI API Key. Otherwise you could get IP banned.

ADD REPLY
0
Entering edit mode

I tried :

epost -db gene -format acc -input dic.names |esummary | xtract -pattern DocumentSummary -element Name

It worked much faster but it retrieved only the gene accessions and without the without the transcripts accessions.

ADD REPLY
1
Entering edit mode
2.3 years ago
MirianT_NCBI ▴ 730

Hi,

You can use NCBI Datasets to get the information and jq to retrieve your search term and the gene id. Using your sample list as an example, here's how to do it (I'm assuming you would like to save the output in a separate file)

gene.list

XM_044462096.1 
XM_044462097.1 
XM_044462098.1 
XM_044462099.1 
XM_044462100.1 
XM_044462101.1 
XM_044462102.1 
XM_044462103.1 
XM_044462104.1 
XM_044462105.1

Command:

cat gene.list | while read GENE; do 
  datasets summary gene accession "${GENE}" | jq -r '[.genes[] 
  | .query[],.gene.gene_id] 
  | @tsv' >> accession_id.tsv; 
done

Result:

XM_044462096.1  123038202
XM_044462097.1  123038204
XM_044462098.1  123038205
XM_044462099.1  123038203
XM_044462100.1  123038207
XM_044462101.1  123038208
XM_044462102.1  123038211
XM_044462103.1  123038212
XM_044462104.1  123038213
XM_044462105.1  123038215

If you prefer to have only the gene-ids, you can remove this part .query[], from the command.

Feel free to reach out if you have any additional questions. :)

ADD COMMENT
0
Entering edit mode

Instead of a loop, you can use the --inputfile instead. The jq command also changes a bit (no need for .genes[] in the beginning. Like this:

datasets summary gene accession --inputfile gene.list --as-json-lines |  jq -r '[.query[],.gene.gene_id] | @tsv'
XM_044462096.1  123038202
XM_044462099.1  123038203
XM_044462097.1  123038204
XM_044462098.1  123038205
XM_044462100.1  123038207
XM_044462101.1  123038208
XM_044462102.1  123038211
XM_044462103.1  123038212
XM_044462104.1  123038213
XM_044462105.1  123038215
ADD REPLY
0
Entering edit mode

Thank you, MirianT_NCBI , Once I get gene ID or accessions, Is there a way within NCBI databases to retrieve GO terms for the genes?

ADD REPLY
1
Entering edit mode

Hi hans, Unfortunately, we don't have GO term info in our data report at this moment. You can find that information in this file from our FTP site.

I checked the list of taxids, and there are 42 species with GO term information there. Based on your sample list, all the accessions were from wheat, and I didn't find any entries for that species in the gene2go file.

Besides GO terms, is there any other features that would be useful for your research? I'll pass your suggestions to our team.

Thanks!

ADD REPLY
0
Entering edit mode

I tried your suggestion. datasets & jq works, but very slow, maybe narrow bandwidth at my end. It took 16 hours to retrieve 80k IDs. Thank you MirianT_NCBI

ADD REPLY
1
Entering edit mode
2.3 years ago
vkkodali_ncbi ★ 3.7k

I suggest you check out the gene2refseq.gz file located here: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz It has the following fields:

                                #tax_id [  1]: 4565
                                 GeneID [  2]: 123038202
                                 status [  3]: MODEL
       RNA_nucleotide_accession.version [  4]: XM_044462096.1
                      RNA_nucleotide_gi [  5]: 2123643831
              protein_accession.version [  6]: XP_044318031.1
                             protein_gi [  7]: 2123643832
   genomic_nucleotide_accession.version [  8]: NC_057797.1
                  genomic_nucleotide_gi [  9]: 2097423184
start_position_on_the_genomic_accession [ 10]: 755260944
  end_position_on_the_genomic_accession [ 11]: 755264329
                            orientation [ 12]: +
                               assembly [ 13]: Reference IWGSC CS RefSeq v2.1 Primary Assembly
       mature_peptide_accession.version [ 14]: -
                      mature_peptide_gi [ 15]: -
                                 Symbol [ 16]: LOC123038202

You can use something like grep as follows:

$ cat accs.txt 
XM_044462096.1
XM_044462097.1
...
$ zgrep -f accs.txt -w gene2refseq.gz
4565    123038202       MODEL   XM_044462096.1  2123643831      XP_044318031.1  2123643832      NC_057797.1     2097423184      755260944       755264329       +       Reference IWGSC CS RefSeq v2.1 Primary Assembly -       -       LOC123038202
4565    123038203       MODEL   XM_044462099.1  2123578877      XP_044318034.1  2123578878      NC_057794.1     2097423188      98080986        98086310        -       Reference IWGSC CS RefSeq v2.1 Primary Assembly -       -       LOC123038203
...

You may still end up with a small number of RefSeqs that are not in the gene2refseq file because they are no long live. For those, you can use the EntrezDirect approach mentioned above.

ADD COMMENT
0
Entering edit mode

Suggestion for NCBI staff, vkkodali_ncbi @miriant_ncbi. There is a nice table linking proteins and gene ID for each genome e.g https://www.ncbi.nlm.nih.gov/genome/browse/#!/proteins/11/1613926%7CTriticum%20aestivum/. Adding the transcripts accesions to this table will be of a great help for those that do master CLI.
Thank you

ADD REPLY
0
Entering edit mode
2.3 years ago
hans ▴ 20

The fastest way was to download https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz. However, instead of using:

zgrep -f trancriptsCS.list -w gene2refseq.gz

That took very long time, I used the following in R:

library(data.table)
db<-fread('gene2refseq.gz')
id<-fread('trancriptsCS.list',head=F)
colnames(db)[1]<-'tax'
dw<-db[tax==4565,.(RNA_nucleotide_accession.version,GeneID)]
id2gene<-dw[id,on=c(RNA_nucleotide_accession.version='V1'),nomatch=NULL]`

It took less than 10 minutes to read the file and one minute to join the data.tables

ADD COMMENT

Login before adding your answer.

Traffic: 2668 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6