Is there a faster way to download gene sequences from NCBI via E-utils
0
0
Entering edit mode
25 days ago

Hello,

I am trying to download gene sequences from NCBI via E-utils like this

esearch -db gene -q "1496[Taxonomy ID] AND proC[Gene Name] AND alive[prop]" | elink -db gene -target nuccore | efetch -db nuccore -format gene_fasta > proC_1496_all.fasta

./fasta-unfold.sh proC_1496_all.fasta | egrep -A 1 "$gene=proC$" > proC_1496.fasta

Here fasta-unfold.sh is my script that just makes the fasta file that one line would be header and the next line would be the sequence. I would like to download a sequence for proC gene for a particular species. Unfortunately there are more than 100 records in the nucleotide database and it takes a long time to download the file.

After doing some basic comparisons (like shown below) it turns out that only 8 sequences (out of more than a 100) are unique.

cat proC_1496.fasta | egrep -A 1 "$gene=proC$" | grep -v '>' | sort -u | wc -l

I though maybe it would be possible to download only a sequence by coordinates, but

esearch -db gene -q "1496[Taxonomy ID] AND proC[Gene Name] AND alive[prop]" | esummary

does not contain the sequence ID, start and stop coordinates of the gene in any sequence.

Thank you for any suggestion in advance