Question

Is there a faster way to download gene sequences from NCBI via E-utils

0

Entering edit mode

3.0 years ago

lmlukoseviciute ▴ 60

Hello,

I am trying to download gene sequences from NCBI via E-utils like this

esearch -db gene -q "1496[Taxonomy ID] AND proC[Gene Name] AND alive[prop]"  | elink -db gene -target nuccore | efetch -db nuccore -format gene_fasta > proC_1496_all.fasta
./fasta-unfold.sh proC_1496_all.fasta | egrep -A 1 "\[gene=proC\]" > proC_1496.fasta

Here fasta-unfold.sh is my script that just makes the fasta file that one line would be header and the next line would be the sequence. I would like to download a sequence for proC gene for a particular species. Unfortunately there are more than 100 records in the nucleotide database and it takes a long time to download the file.

After doing some basic comparisons (like shown below) it turns out that only 8 sequences (out of more than a 100) are unique.

cat proC_1496.fasta | egrep -A 1 "\[gene=proC\]" | grep -v '>' | sort -u | wc -l

I though maybe it would be possible to download only a sequence by coordinates, but

esearch -db gene -q "1496[Taxonomy ID] AND proC[Gene Name] AND alive[prop]"  | esummary

does not contain the sequence ID, start and stop coordinates of the gene in any sequence.

So I wonder is there a way to download only a sequence of a particular gene using E-utils, without downloading all related sequences from Nucleotide database?

Thank you for any suggestion in advance

NCBI E-utils Unix • 625 views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 3.0 years ago by lmlukoseviciute ▴ 60