Retrieve Large Numbers of NCBI Records
2
0
Entering edit mode
2.4 years ago
andorjkiss ▴ 40

I don't understand how to use E-UTILITIES. I'm trying to download the records associated with the REGene DB. Is there a simple to use GUI based application for this? TIA

NCBI GenBank • 2.3k views
ADD COMMENT
2
Entering edit mode

What kind of records? Do you have ID's/gene names? Post examples.

If you want a GUI based alternative you may want to check out NCBI Datasets. It may not give you access to exactly what you want but then it may.

ADD REPLY
0
Entering edit mode

Yes, I would like to build a local copy of the database that used in this paper (https://www.nature.com/articles/srep23167), and I have the GeneID, GeneSymb, GI, RefSeq and Organism - the number of records is 8460. Apparently there's a way to download just these records (as FASTA) via NCBI E-Utilities, but I can't figure it out (mainly the format of the commands).

My understanding is that I should be able to upload this list of GIs and then recursively download in 500 chunks the NCBI GenBank FASTA records. Some combination of EPost and EFetch, but I don't know how to structure the URLs and I'm unfamiliar with PERL.

Example of the first 10 rows of the file:

enter image description here

ADD REPLY
0
Entering edit mode

I'll try this...

ADD REPLY
0
Entering edit mode

It looks as if this (NCBI DATASETS) worked - I've requested a ZIPed DATASET download; we'll see if that works. If not, I'll try your commands below. One would have to install EDirect via the bash script so that one can use it in the terminal, correct?

  • Thanks
ADD REPLY
0
Entering edit mode

You can install entrez-direct using conda.

ADD REPLY
0
Entering edit mode

Doesn't work. NCBI DATASETs barfs when you try to download the entire dataset.

ADD REPLY
0
Entering edit mode

Retrieve in smaller batches. cat locally after download.

ADD REPLY
0
Entering edit mode

Can you describe the command and error you're getting? Thanks!

ADD REPLY
1
Entering edit mode
2.4 years ago
GenoMax 141k

Using EntrezDirect:

$ more id2
468509
711526
480169
12558
30291

$ for i in `cat id2`; do esearch -db gene -query "${i}[GeneID] AND ALIVE [PROP]" | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta >> gene_seq; done

File gene_seq will contain sequences.

Note: There may be more than one sequence per gene ID even though your table has only one row.

$ for i in `cat id2`; do esearch -db gene -query "${i}[GeneID] AND ALIVE [PROP]" | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta | grep ">"; done
>XM_016933484.2 PREDICTED: Pan troglodytes cadherin 2 (CDH2), transcript variant X2, mRNA
>XM_523898.6 PREDICTED: Pan troglodytes cadherin 2 (CDH2), transcript variant X1, mRNA
>XM_028838057.1 PREDICTED: Macaca mulatta cadherin 2 (CDH2), transcript variant X3, mRNA
>XM_028838055.1 PREDICTED: Macaca mulatta cadherin 2 (CDH2), transcript variant X2, mRNA
>XM_015121712.2 PREDICTED: Macaca mulatta cadherin 2 (CDH2), transcript variant X1, mRNA
>NM_001287156.2 Canis lupus familiaris cadherin 2 (CDH2), mRNA
>NM_007664.5 Mus musculus cadherin 2 (Cdh2), mRNA
>XM_006525553.2 PREDICTED: Mus musculus cadherin 2 (Cdh2), transcript variant X1, mRNA
>NM_131081.2 Danio rerio cadherin 2, type 1, N-cadherin (neuronal) (cdh2), mRNA
ADD COMMENT
0
Entering edit mode
$ for i in `cat regen.csv`; do esearch -db gene -query "${i}[GeneID] AND ALIVE [PROP]" | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta >> gene_seq; done
curl: (3) URL using bad/illegal format or missing URL
 ERROR:  curl command failed ( Tue 23 Nov 2021 11:37:57 AM EST ) with: 3
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?query_key=1&WebEnv=MCID_619d18e5a83b5f73a76527d4&retstart=0&retmax=1&db=gene&rettype=uilist&retmode=text&api_key=ca78f0a08d593f73292dbfbd65c103e96b08&tool=edirect&edirect=16.2&edirect_os=Linux&email=
 WARNING:  FAILURE ( Tue 23 Nov 2021 11:37:56 AM EST )
nquire -get https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -query_key 1 -WebEnv MCID_619d18e5a83b5f73a76527d4 -retstart 0 -retmax 1 -db gene -rettype uilist -retmode text -api_key ca78f0a08d593f73292dbfbd65c103e96b08 -tool edirect -edirect 16.2 -edirect_os Linux -email 
EMPTY RESULT
SECOND ATTEMPT
ADD REPLY
0
Entering edit mode

You need to have one id per line. Looks like you tried to use a comma delimited file?

ADD REPLY
1
Entering edit mode
2.4 years ago
MirianT_NCBI ▴ 720

Based on your post, I got the lists of genes from here: http://regene.bioinfo-minzhao.org/download.cgi
I downloaded both lists and created a single list of gene-ids using this command:

for f in *.txt; do cut -f1 $f | grep "^[0-9]" | sort | uniq >> regen.txt; done

From here, you can use this list to retrieve the gene sequences using datasets. You can install datasets using conda:

conda install -c conda-forge ncbi-datasets-cli

To download the genes, you can type:
datasets download gene gene-id --inputfile regen.txt --exclude-protein --exclude-rna --filename regen.zip

This command will download only the fasta file, and exclude the protein and rna sequences that are included by default in the data package.

Another option is to use NCBI Datasets web interface NCBI Datasets Gene
You can upload a list of genes (like the one created using the first command) or enter your list manually.

gene page

When you get to the gene table, if you select all (by clicking on the box on the left to Gene ID), you can download the dataset. gene download

Let me know if you have any questions. :)

ADD COMMENT
0
Entering edit mode

I don't think the problem is with gene dataset tool. OP is trying to download a large list of ID's and the online datasets appears to fail with that large a list.

ADD REPLY

Login before adding your answer.

Traffic: 1593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6