Question: BLAST multiple staxids
gravatar for sureshhewabi
2.2 years ago by
Sri Lanka
sureshhewabi0 wrote:

I am using following output format to get my blastp output:

-outfmt, 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids

However, I get multiple values for staxids. I wonder why/what are they? Because I was expecting only one taxonomy ID of the subject. I cannot find this on BLAST documentation

Look at this example:

sequence alignment • 1.1k views
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by sureshhewabi0

What is the different between those two? How can a subject can have multiple Taxonomy IDs?

ADD REPLYlink written 2.2 years ago by sureshhewabi0
gravatar for gb
2.2 years ago by
gb1.9k wrote:

From the help page:

staxid means Subject Taxonomy ID
staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)

So you can use staxid instead of staxids

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by gb1.9k

I agree. Thanks! But still I am curious about how we can get multiple Taxonomy IDs when we use staxids option

ADD REPLYlink written 2.2 years ago by sureshhewabi0

I think I do not have the exact explanation for you but look up the taxonids. In your case the protein comes from a Escherichia coli (562) which is a species rank, and Escherichia coli 3-105-05_S1_C2 (1444084) which is the same species but a certain strain. So I think it has something to do with taxonids that have the same species but an extra strain number or code.

ADD REPLYlink written 2.2 years ago by gb1.9k

Yes, I also had a look exactly on those TaxIDs and came to understanding as you. But now I think, may be other strains on the E coli also have the same sequence, that is why I get multiple TaxonomyIDs

ADD REPLYlink written 2.2 years ago by sureshhewabi0

Do you know the taxonomic assignment program MEGAN? Its manual suggests that those multiple IDs are indeed from other organisms with the same sequence: " entry in a reference database may have more than one taxon associated with it. For example, in the NCBI-NR database, an entry may be associated with up to 1000 different taxa. This implies, in particular, that a read that may be assigned to a high level node (even the root node), even though it only has one significant hit, if the corresponding reference sequence is associated with a number of very different species."

So, if a reference sequence has multiple associated IDs, MEGAN assigns it to their "lowest common ancestor" instead of just to the organism that the sequence came from.

The BLAST help suggests that it's not just 'staxids' that can have multiple entries. 'sscinames', 'scomnames', 'sblastnames' and 'sskingdoms' might refer to these associated taxa too. It would be nice if the documentation was clearer!

ADD REPLYlink written 17 months ago by cribdonr0

This is not exactly MEGAN works or how you can use it. Also, you can only determine a certain species if you look at a specific marker gene like 16S or COI. If you blast a COI sequence and you have a significant good hit you can say that that is the right species. If you do not have a good hit then you can use MEGAN, you blast your COI marker and get 5 hits above a certain treshold. Then with MEGAN you can find the lowest common ancestor of those 5 hits and that will be the identification for your gene.

ADD REPLYlink written 17 months ago by gb1.9k

staxids doesn't work for diamond unfortunately.

ADD REPLYlink written 10 months ago by O.rka240
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1572 users visited in the last hour