From scientific name to sequence
1
0
Entering edit mode
2.2 years ago
LaFra ▴ 10

Hi all,

I' m wondering if is there a a way to get the sequence of a gene just from the scientific name of the species. More precisely, I have a list of more than 2000 plants and I need the sequence of the trnL for all of them in order to build a database. Is there a way to do it by creating for example a query? I've seen that Entrez maybe could do it? But I'm really far from being an expert in bioinformatic and I have no idea on how to write the code! I would appreciate if someone could help me.

Thanks a lot!

sequence Entrez scientific name • 1.3k views
ADD COMMENT
0
Entering edit mode

Yes, it is definitely possible, but the query is a bit more complex than one might think. trnL is a tRNA gene and thus has multiple copies with the same name. Try something like this:

esearch -db gene -query '(trnl[gene]) AND (Arabidopsis thaliana[orgn]) ' | esummary
ADD REPLY
0
Entering edit mode

Wow, thanks a lot for your reply! But how to do with multiple names? I have a list of 2000, is it possible to put a file with all the names instead of a single name?

ADD REPLY
0
Entering edit mode

Yes, it is possible too. Put all the names in a text file, one name per line, and then use a little bash script. I can post code as soon as I figure out how to get fasta sequence from a gene entry. It is not as easy as it looks at first :)

ADD REPLY
0
Entering edit mode

seemingly the following query works for a single species:

 esearch -db nuccore -query '(trnl[gene]) AND (Arabidopsis thaliana[orgn]) ' | efetch  -format fasta
ADD REPLY
0
Entering edit mode

It would be amazing! Thanks a lot, you are very kind, I wait for your reply :)

ADD REPLY
0
Entering edit mode

Could it be:

IFS=$'\n'; for next in $(scientificname_list.txt); do esearch -db nuccore -query '(trnl[gene])| efetch -db nucleotide -format fasta; done

?

ADD REPLY
0
Entering edit mode
2.2 years ago
Michael 54k

Try this script:

#!/bin/sh

set -eux

while read line; do
   esearch -db nuccore -query "(trnl[gene]) AND ($line[orgn])" </dev/null | efetch  -format fasta >> output.fa
   sleep 1; # added to not get throttled by entrez server
done <input.txt
ADD COMMENT
0
Entering edit mode

For now it seems it is working, I'll let you know. You saved my life, thanks a lot =)

ADD REPLY
0
Entering edit mode

Hi, as you said, I got multiple copies of the gene for the same sample. Is there a possibility to avoid this? Thank you,

ADD REPLY
0
Entering edit mode

There are multiple copies of tRNA genes in most genomes. If you are building a database of tRNA's that will have to contain all these copies. You could collapse identical sequences after downloading them using CD-Hit. It depends on your scientific question in the end.

ADD REPLY

Login before adding your answer.

Traffic: 2075 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6