Question

NCBI Gene Table Bulk Download

0

Entering edit mode

7 months ago

fafad046 • 0

Does anyone happen to know where I could bulk download all the NCBI gene tables from a site?

I tried looking for it from the NCBI FTP site but couldn't find it... I have been retrieving via E-utitlities as the example below but it has a very strict rate limit and I kept getting restricted...

Thank you so much!

For reference, the gene table looks like below

Reference GSC_HSeal_1.0 Primary Assembly NW_022589744.1  from: 1024297 to: 1169901
mRNA  XM_032429966.1, 41 exons,  total annotated spliced exon length: 4699
protein  XP_032285857.1, 38 coding  exons,  annotated AA length: 1448

Exon table for  mRNA  XM_032429966.1 and protein XP_032285857.1
Genomic Interval Exon           Genomic Interval Coding         Gene Interval Exon              Gene Interval Coding            Exon Length     Coding Length   Intron Length
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1024297-1024408         1-112           112             1647
1026056-1026143         1760-1847               88              2596
1028740-1028805         4444-4509               66              1888
1030694-1030809         1030780-1030809         6398-6513               6484-6513               116             30              3578
1034388-1034548         1034388-1034548         10092-10252             10092-10252             161             161             9654
1044203-1044317         1044203-1044317         19907-20021             19907-20021             115             115             2382
1046700-1046830         1046700-1046830         22404-22534             22404-22534             131             131             884
1047715-1047797         1047715-1047797         23419-23501             23419-23501             83              83              1778
1049576-1049719         1049576-1049719         25280-25423             25280-25423             144             144             13887
1063607-1063750         1063607-1063750         39311-39454             39311-39454             144             144             5731
1069482-1069553         1069482-1069553         45186-45257             45186-45257             72              72              7716
1077270-1077764         1077270-1077764         52974-53468             52974-53468             495             495             2040
1079805-1079942         1079805-1079942         55509-55646             55509-55646             138             138             1245
1081188-1081259         1081188-1081259         56892-56963             56892-56963             72              72              4868
1086128-1086271         1086128-1086271         61832-61975             61832-61975             144             144             3422
1089694-1089768         1089694-1089768         65398-65472             65398-65472             75              75              2641
1092410-1092553         1092410-1092553         68114-68257             68114-68257             144             144             2442
1094996-1095067         1094996-1095067         70700-70771             70700-70771             72              72              3131
1098199-1098342         1098199-1098342         73903-74046             73903-74046             144             144             586
1098929-1099000         1098929-1099000         74633-74704             74633-74704             72              72              4876
1103877-1103921         1103877-1103921         79581-79625             79581-79625             45              45              3898
1107820-1107969         1107820-1107969         83524-83673             83524-83673             150             150             2696
1110666-1110737         1110666-1110737         86370-86441             86370-86441             72              72              8609
1119347-1119490         1119347-1119490         95051-95194             95051-95194             144             144             3019

and for now, I retrieve gene table using this command

efetch -db gene -id "$(cat blast_files/${first_protein_id}_orthologs_geneids.txt)" -format gene_table > ${first_protein_id}_gene_table.txt

ncbi gene_table • 995 views

ADD COMMENT • link updated 7 months ago by GenoMax 142k • written 7 months ago by fafad046 • 0

0

Entering edit mode

very strict rate limit and I kept getting restricted...

Are you looking up millions of entries? If not you should not be hitting the limit. As I had said in a past answer you should add a delay (with sleep or something else) so there are pauses between sets of queries.

ADD REPLY • link 7 months ago by GenoMax 142k

0

Entering edit mode

Yes, I have close to million entries. I am currently splitting it up by chunks of 5 and sleep it for 3 seconds. I checked and the rate limit with ncbi_api_key is 10queries/second but it still gives me empty requests and takes a very long time to run, which is frustrating..

Would be nice if I could just batch donwload all this information from a database somewhere. Thank you for any suggestions!

#!/bin/bash

# Set your email and API key
export NCBI_API_KEY="xxxxxx"
export NCBI_EMAIL="xxx@gmail.com"

chunk_size=5

for gene_id_file in *_orthologs_geneids.txt; do
    # Extract protein ID from filename
    first_protein_id=$(basename "$gene_id_file" "_orthologs_geneids.txt")

    echo "Processing $first_protein_id ..."

    # If you want to reset the gene table for each protein ID, uncomment the next line:
    # > ${first_protein_id}_gene_table.txt

    # Split the input file into smaller chunks
    split -l $chunk_size "$gene_id_file" "${first_protein_id}_gene_ids_chunk_"

    # Process each chunk
    for chunk in ${first_protein_id}_gene_ids_chunk_*; do
        efetch -db gene -id "$(cat $chunk)" -format gene_table -email xxx@gmail.com >> ${first_protein_id}_gene_table.txt
        sleep 3  # Wait for 3 seconds before the next request
    done

    # Cleanup the chunks
    rm ${first_protein_id}_gene_ids_chunk_*
done

echo "All files processed."

ADD REPLY • link 7 months ago by fafad046 • 0

0

Entering edit mode

While it may be frustrating what you are trying to do is likely beyond the envisioned usage of this tool (million queries). As suggested by vkkodali_ncbi below you could try to find the GTF/GFF files for thousands of genomes or re-reprocess the missing accessions over multiple days until you get them all. Either way this is going to be painful.

Even with the infrastructure NCBI has it must take time to do these look-ups. You can try increasing the delay to 10 sec and see if that helps.

ADD REPLY • link 7 months ago by GenoMax 142k

score 0 · Answer 1 · 2023-09-09

0

Entering edit mode

7 months ago

Jiyao Wang ▴ 370

You can try NCBI datasets: https://www.ncbi.nlm.nih.gov/datasets/

ADD COMMENT • link 7 months ago by Jiyao Wang ▴ 370

0

Entering edit mode

Can you given an example of which command in the command line datasets will generate output like above? Are datasets lookups not restricted by unit queries per time?

ADD REPLY • link 7 months ago by GenoMax 142k

1

Entering edit mode

You cannot get the exact table using datasets. But depending on what you are looking for, you may be able to find some of the info by doing the following:

$ datasets summary gene  gene-id 5768 --report product --as-json-lines \
  | dataformat tsv gene-product --fields transcript-accession,transcript-genomic-location-exon-order,transcript-genomic-location-accession,transcript-genomic-location-range-start,transcript-genomic-location-range-stop,transcript-genomic-location-range-orientation \
  | head -n 5 

Transcript Accession   Transcript Genomic Exons Order   Transcript Genomic Accession    Transcript Genomic Start   Transcript Genomic Stop   Transcript Genomic Orientation
--------------------   ------------------------------   -----------------------------   ------------------------   -----------------------   ------------------------------
NM_002826.5            1                                NC_000001.11                    180154869                  180204030                 plus                          
NM_002826.5            2                                NC_000001.11                    180154869                  180204030                 plus                          
NM_002826.5            3                                NC_000001.11                    180154869                  180204030                 plus

There are additional fields that you can provide to dataformat. You can find them here:

dataformat tsv gene-product --help

Finally, you can use EntrezDirect with an API key if you run into throttling. You can read more about it here

ADD REPLY • link 7 months ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

thank you for the help vkkodali_ncbi ! I particularly need the exon/coding positions for each protein, so that I can deduce out their intron positions in the protein sequence.

Any suggestions on where I could batch download every proteins' exon/coding positions would be extremely helpful, thanks again!

ADD REPLY • link 7 months ago by fafad046 • 0

0

Entering edit mode

datasets output does have the exon ranges in genomic coordinates. However it does not include the same information for CDS ranges.

Any suggestions on where I could batch download every proteins' exon/coding positions would be extremely helpful, thanks again!

Parsing the annotation GFF3 files or using EntrezDirect are the only two options I can think of. For parsing GFF3s, a program like agat is an excellent choice. Alternatively, you can use something similar to this.

ADD REPLY • link 7 months ago by vkkodali_ncbi ★ 3.7k