Question

From scientific name to sequence

0

Entering edit mode

2.2 years ago

LaFra ▴ 10

Hi all,

I' m wondering if is there a a way to get the sequence of a gene just from the scientific name of the species. More precisely, I have a list of more than 2000 plants and I need the sequence of the trnL for all of them in order to build a database. Is there a way to do it by creating for example a query? I've seen that Entrez maybe could do it? But I'm really far from being an expert in bioinformatic and I have no idea on how to write the code! I would appreciate if someone could help me.

Thanks a lot!

sequence Entrez scientific name • 1.3k views

ADD COMMENT • link updated 2.2 years ago by Michael 54k • written 2.2 years ago by LaFra ▴ 10

0

Entering edit mode

Yes, it is definitely possible, but the query is a bit more complex than one might think. trnL is a tRNA gene and thus has multiple copies with the same name. Try something like this:

esearch -db gene -query '(trnl[gene]) AND (Arabidopsis thaliana[orgn]) ' | esummary

ADD REPLY • link 2.2 years ago by Michael 54k

0

Entering edit mode

Wow, thanks a lot for your reply! But how to do with multiple names? I have a list of 2000, is it possible to put a file with all the names instead of a single name?

ADD REPLY • link 2.2 years ago by LaFra ▴ 10

0

Entering edit mode

Yes, it is possible too. Put all the names in a text file, one name per line, and then use a little bash script. I can post code as soon as I figure out how to get fasta sequence from a gene entry. It is not as easy as it looks at first :)

ADD REPLY • link 2.2 years ago by Michael 54k

0

Entering edit mode

seemingly the following query works for a single species:

 esearch -db nuccore -query '(trnl[gene]) AND (Arabidopsis thaliana[orgn]) ' | efetch  -format fasta

ADD REPLY • link 2.2 years ago by Michael 54k

0

Entering edit mode

It would be amazing! Thanks a lot, you are very kind, I wait for your reply :)

ADD REPLY • link 2.2 years ago by LaFra ▴ 10

0

Entering edit mode

Could it be:

IFS=$'\n'; for next in $(scientificname_list.txt); do esearch -db nuccore -query '(trnl[gene])| efetch -db nucleotide -format fasta; done

?

ADD REPLY • link 2.2 years ago by LaFra ▴ 10

score 0 · Answer 1 · 2022-02-25

0

Entering edit mode

2.2 years ago

Michael 54k

Try this script:

#!/bin/sh

set -eux

while read line; do
   esearch -db nuccore -query "(trnl[gene]) AND ($line[orgn])" </dev/null | efetch  -format fasta >> output.fa
   sleep 1; # added to not get throttled by entrez server
done <input.txt

ADD COMMENT • link 2.2 years ago by Michael 54k

0

Entering edit mode

For now it seems it is working, I'll let you know. You saved my life, thanks a lot =)

ADD REPLY • link 2.2 years ago by LaFra ▴ 10

0

Entering edit mode

Hi, as you said, I got multiple copies of the gene for the same sample. Is there a possibility to avoid this? Thank you,

ADD REPLY • link 2.2 years ago by LaFra ▴ 10

0

Entering edit mode

There are multiple copies of tRNA genes in most genomes. If you are building a database of tRNA's that will have to contain all these copies. You could collapse identical sequences after downloading them using CD-Hit. It depends on your scientific question in the end.

ADD REPLY • link 2.2 years ago by Michael 54k