Question: Genbank - Finding Unique Taxon Id'S For A Specific Locus
0
gravatar for sebbe.kvist
6.8 years ago by
sebbe.kvist0 wrote:

Does anyone know of any way to retrieve quantitative information on specific taxa for a specific locus on GenBank. For example, I can easily pull out the number of COI sequences for Annelida from GenBank but what if I want to know how many species of Annelida are represented by COI in GenBank. If I search for Annelida AND COI, I will get all sequences for (e.g.,) Eisenia fetida but I only want this to count as 1 if I want to know how many unique species have COI's in GenBank. Does that make sense? I appreciate all ideas on this.

id taxonomy genbank • 1.9k views
ADD COMMENTlink modified 6.7 years ago • written 6.8 years ago by sebbe.kvist0
3
gravatar for Neilfws
6.8 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

I don't know that there is an easy way to do this within the NCBI website. However, you can get some of the way with a well-defined search and some scripting.

I would start with a taxonomy search for Annelida. Clicking through (just keep clicking on "Annelida") gets to the summary page, showing (at present) 31 606 nucleotide records.

We can then qualify that search further:

txid6340[Organism:exp] AND COI[Gene]

to retrieve 11 208 records.

At this point, you should click on "Send to -> File" and choose an output format. "Genbank" or "XML" are good choices, since they are structured formats and will contain a field for the species. However, this may result in a large download.

Genbank format includes the field ORGANISM, e.g.

ORGANISM  Drilonereis longa

So we can count up the terms that come after the word ORGANISM using grep, cut and awk:

grep ORGANISM sequence.gb | 
cut -d " " -f 4- | 
awk 'BEGIN{OFS="\t"} {n[$0]++} END {for (i in n) {print i, n[i]}}'
> wormcount.txt

Result (first 5 lines only):

Nephtys sp. CMC05    1
Nephtys sp. CMC06    1
Amynthas lini    9
Mesenchytraeus solifugus    64
Satchellius mammalis    11

Unique ORGANISM with COI:

wc -l wormcount.txt
# 1995 wormcount.txt
ADD COMMENTlink written 6.8 years ago by Neilfws48k
0
gravatar for sebbe.kvist
6.7 years ago by
sebbe.kvist0 wrote:

Thanks a lot, I thought that it might come down to a bit of simple scripting. This seems straight forward, I'll post if I run into trouble. Thanks again!

ADD COMMENTlink written 6.7 years ago by sebbe.kvist0
0
gravatar for sebbe.kvist
6.7 years ago by
sebbe.kvist0 wrote:

Should I ever publish this information, I would like to reference you in the acknowledgements. What name would you want me to put in the acknowledgements? Thanks a lot again, this saved me a lot of time!! Sebastian

ADD COMMENTlink written 6.7 years ago by sebbe.kvist0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1010 users visited in the last hour