Obtaining sequences of a particular gene under a particular taxonomy (using E-utilities)
2
0
Entering edit mode
9.8 years ago
cjb60 • 0

Hi there,

I'm having a bit of trouble trying to find out how to construct a particular pipeline using the Entrez E-utilities. Specifically, what I want to do is, do a search in the Taxonomy database, i.e.

Insecta[ORGN] AND genus[RANK]

Then, for each ID returned in that search, find all its IDs in the Nucleotide database, but then filter those Nucleotide IDs with a Nucleotide query, which is this:

cox1[gene]

Then get the FASTA sequences. And I would like to preserve the mapping between the IDs, so ideally I would get something like this:

tax_id_1 --> nucleotide_id_1 --> fasta sequence
tax_id_2 --> nucleotide_id_2 --> fasta sequence
...
tax_id_n --> nucleotide_id_n --> fasta sequence

(Where tax_id_1 and tax_id_2 are IDs from the Taxonomy query, nucleotide_id_1 is the COX1 gene sequence for tax_id_1, nucleotide_id_2 is the COX1 gene sequence for tax_id_2, etc.)

At the moment I'm using Python to do this, and then decided to do this through the browser (just to keep things simple). I have used Elink to handle the Taxonomy query, Elink to map from the Taxonomy IDs to the Nucleotide IDs (preserving the one-to-one correspondence), but I'm stuck on how to then filter those Nucleotide IDs so that I only get COX1 as the gene. I have tried doing this previously with varying degrees of success, and even if I did manage to pull it off, I'd probably have done it in such an unelegant way!

How would you go about doing something like this?

Cheers

sequence gene • 2.2k views
ADD COMMENT
0
Entering edit mode

Thanks both of you, I did it the other way around and I think I got what I wanted now (with 1000x less hassle) by starting from the nucleotide database then working toward taxonomy (as per your suggestions). I'm still curious about whether it's possible to use E-Utilities to filter a list of IDs based on query (see my reply below scapella's response) but at this stage it's not a big deal at all.

ADD REPLY
3
Entering edit mode
9.8 years ago
scapella ▴ 390

Hi there,

Have you considered first to look first for all entries annotated as "COX1" - or any term you like to look at - and then later on filter based on taxonomy. When retrieving data from NCBI - specially using Biopython package - you have information about the taxonomy of each record. My guess is there are much less COX1 gene entries than species under a generic name such as Insecta.

Hope this may give you some hints about how to proceed.

S

ADD COMMENT
0
Entering edit mode

Hi there,

I'll try follow up on your recommendation in a couple of days (have exams), but the main problem I was experiencing was trying to find out how to ESearch, but only ESearch on a list of UIDs, and it seems like ESearch doesn't have a parameter for this.

For example, we do the Taxonomy search and get Taxonomy IDs (ESearch), then we use ELink so we get a one-to-one mapping from each Taxonomy ID to its Nucleotide IDs. So say Taxonomy ID "1234567" was mapped to 10,000 Nucleotide IDs, how do we then filter those 10,000 IDs based on a Nucleotide query?

Cheers

ADD REPLY
1
Entering edit mode
9.8 years ago
Neilfws 49k

Why not search the nucleotide database for COX1 and using the ORGN term?

Insecta[ORGN] AND COX1[GENE]

This returns18 186 results, currently.

ADD COMMENT
0
Entering edit mode

It's because I need to be able to use the [RANK] query, which I can't seem to do in the nucleotide database.

ADD REPLY
1
Entering edit mode

It's not clear to me what value RANK adds to your query. I can see that it returns insect genera but as scapella said, I think it's easier to get COX1 from insects first, then worry about genera later.

ADD REPLY

Login before adding your answer.

Traffic: 2806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6