Does anyone happen to know where I could bulk download all the NCBI gene tables from a site?
I tried looking for it from the NCBI FTP site but couldn't find it... I have been retrieving via E-utitlities as the example below but it has a very strict rate limit and I kept getting restricted...
Thank you so much!
For reference, the gene table looks like below
Reference GSC_HSeal_1.0 Primary Assembly NW_022589744.1 from: 1024297 to: 1169901
mRNA XM_032429966.1, 41 exons, total annotated spliced exon length: 4699
protein XP_032285857.1, 38 coding exons, annotated AA length: 1448
Exon table for mRNA XM_032429966.1 and protein XP_032285857.1
Genomic Interval Exon Genomic Interval Coding Gene Interval Exon Gene Interval Coding Exon Length Coding Length Intron Length
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1024297-1024408 1-112 112 1647
1026056-1026143 1760-1847 88 2596
1028740-1028805 4444-4509 66 1888
1030694-1030809 1030780-1030809 6398-6513 6484-6513 116 30 3578
1034388-1034548 1034388-1034548 10092-10252 10092-10252 161 161 9654
1044203-1044317 1044203-1044317 19907-20021 19907-20021 115 115 2382
1046700-1046830 1046700-1046830 22404-22534 22404-22534 131 131 884
1047715-1047797 1047715-1047797 23419-23501 23419-23501 83 83 1778
1049576-1049719 1049576-1049719 25280-25423 25280-25423 144 144 13887
1063607-1063750 1063607-1063750 39311-39454 39311-39454 144 144 5731
1069482-1069553 1069482-1069553 45186-45257 45186-45257 72 72 7716
1077270-1077764 1077270-1077764 52974-53468 52974-53468 495 495 2040
1079805-1079942 1079805-1079942 55509-55646 55509-55646 138 138 1245
1081188-1081259 1081188-1081259 56892-56963 56892-56963 72 72 4868
1086128-1086271 1086128-1086271 61832-61975 61832-61975 144 144 3422
1089694-1089768 1089694-1089768 65398-65472 65398-65472 75 75 2641
1092410-1092553 1092410-1092553 68114-68257 68114-68257 144 144 2442
1094996-1095067 1094996-1095067 70700-70771 70700-70771 72 72 3131
1098199-1098342 1098199-1098342 73903-74046 73903-74046 144 144 586
1098929-1099000 1098929-1099000 74633-74704 74633-74704 72 72 4876
1103877-1103921 1103877-1103921 79581-79625 79581-79625 45 45 3898
1107820-1107969 1107820-1107969 83524-83673 83524-83673 150 150 2696
1110666-1110737 1110666-1110737 86370-86441 86370-86441 72 72 8609
1119347-1119490 1119347-1119490 95051-95194 95051-95194 144 144 3019
and for now, I retrieve gene table using this command
efetch -db gene -id "$(cat blast_files/${first_protein_id}_orthologs_geneids.txt)" -format gene_table > ${first_protein_id}_gene_table.txt
Are you looking up millions of entries? If not you should not be hitting the limit. As I had said in a past answer you should add a delay (with
sleep
or something else) so there are pauses between sets of queries.Yes, I have close to million entries. I am currently splitting it up by chunks of 5 and sleep it for 3 seconds. I checked and the rate limit with ncbi_api_key is 10queries/second but it still gives me empty requests and takes a very long time to run, which is frustrating..
Would be nice if I could just batch donwload all this information from a database somewhere. Thank you for any suggestions!
While it may be frustrating what you are trying to do is likely beyond the envisioned usage of this tool (million queries). As suggested by vkkodali_ncbi below you could try to find the GTF/GFF files for thousands of genomes or re-reprocess the missing accessions over multiple days until you get them all. Either way this is going to be painful.
Even with the infrastructure NCBI has it must take time to do these look-ups. You can try increasing the delay to 10 sec and see if that helps.