BLAST+ Collapsed Results
2
0
Entering edit mode
8.7 years ago
athey.johnc ▴ 60

I have a question about BLAST+ and the "feature" where identical results get collapsed. (I'm using blast locally, version 2.2.30+ on a Ubuntu 12.04 desktop.)

For example, when I do a blastx with the human factor IX sequence against refseq_protein, a few results down, I see this output:

>ref|XP_005594774.1| PREDICTED: coagulation factor IX [Macaca fascicularis]
 ref|XP_011739181.1| PREDICTED: coagulation factor IX [Macaca nemestrina]
Length = 461

# (... stats/alignment here... )

I have two questions. First, why does this collapsing occur? Even if the protein sequence is identical, the fact that the two entries have unique accession numbers indicates to me that the RefSeq data base should be considering them separately (though in this blog post [http://blastedbio.blogspot.com/2012/05/blast-tabular-missing-descriptions.html], he notes that this phenomenon is occurring in his example due to nr, not blast). Is this a "feature" of RefSeq, blast itself, or both?

Second, how can I avoid this problem? For the project I'm working on, I'm actually interested in seeing every species represented in the list, no matter if their protein sequences are identical. Is there some option in blast that I can configure to disable this behavior? If not, I suppose I'll likely have to output the data in another format and then parse it in Python (or similar); is there an easy way to do this? The tabular output format (outfmt 6), with options 'salltitles' and 'sscinames' has gotten me the closest to what I want to see, but the entries for these collapsed results appear on a single line and make it hard to process both by eye and programmatically. I don't think what I'm looking for is particularly complicated, nor do I imagine it's an uncommon desire, but despite considerable time spent Googling, I still can't find an answer for why this happens or how to avoid it.

Thanks!

blast alignment refseq • 2.1k views
ADD COMMENT
2
Entering edit mode
8.7 years ago
athey.johnc ▴ 60

In case anyone finds this in the future, with BLAST+ 2.2.31 that was released in June 2015, a new output format (outfmt 14), XML2, was released. Unlike the original XML output (outfmt 5), XML2 will properly split the hit ID/description tags, and most centrally to this question, each of these "collapsed" hits will have the information associated with each hit parsed into separate <HitDescr> tags, contained under a single parent <Hit> tag (see here for NCBI's documentation on it).

Though the XML2 format is not without bugs, this new format makes it very easy to retrieve the descriptions (and taxonomy information!) for each collapsed result with an XML parser. Though Biopython (my tool of choice) does not yet have a module to parse this file, I was able to build my own quick-and-dirty XML parser with ElementTree to get the information I needed (and if I could figure it out, probably anyone can).

ADD COMMENT
0
Entering edit mode

You may "accept" your own answer - thus indicating to future readers a good solution - as it indeed solves your problem.

ADD REPLY
0
Entering edit mode

Good to know. Thanks!

ADD REPLY
0
Entering edit mode

It should be done cautiously, as it is easy to misuse it, but in this case I think it is warranted.

ADD REPLY
0
Entering edit mode
8.7 years ago
h.mon 35k

From the RefSeq "About" page:

Distinguishing Features

The main features of the RefSeq collection include:

non-redundancy

...

Probably your best approach is to parse the results and split multi-species records.

ADD COMMENT
0
Entering edit mode

RefSeq is not actually that non-redundant. It's better than nr, which is why I'm using it, but there is significant overlap in a lot of sequences that is actually making my job a lot harder than it needs to be. See for example: NP_001171009.1, NP_001171010.1, NP_899245.1, as well as the other examples pointed out in What Is A "Gene"? Redundancies In Refseq Versus Ensembl, as well as this paragraph from RefSeq's chapter in the NCBI handbook (here): "The non-redundant nature of the RefSeq collection facilitates database inquiries based on genomic location, sequence, or text annotation. Be aware, however, that the RefSeq collection does include alternatively spliced transcripts encoding the same protein or distinct protein isoforms, in addition to orthologs, paralogs, and alternative haplotypes for some organisms, which will affect the outcome of a database query." If my BLAST results were being collapsed for entries like those mouse genes, at least then I would understand it, because they're at least from a single species. In my original question, those macaques aren't even the same species, and RefSeq has already given them separate accessions. In either case, I wish there was at least some option to disable this behavior.

ADD REPLY
0
Entering edit mode

I think non-redundant in this case means "only a single instance of a particular sequence", that is, if two species have identical peptide/RNA/DNA sequence, they will be collapsed, which is what happened to these Macaca sp. records. Thanks for the other post link, it is a good discussion and explanation on the innards of the databases.

ADD REPLY

Login before adding your answer.

Traffic: 2773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6