Can I Have Blast Return One Hit Per Species
1
0
Entering edit mode
11.4 years ago
qiyunzhu ▴ 430

Hello all, I've been long troubled with redundant blast hits from different strains of the same species, for example, E. coli. These hits (can be dozens to hundreds) mask hits from other species which I'm more interested in. So I wonder if there's a way to have blast return only one hit from one species? Thanks!

PS: I'm quite okay with writing a script to filter BLAST hits to one-hit-per-species after I got the hits. But my goal is to do so on BLAST itself by manipulating its parameters, so that the hits are already filterred on the server side. So I don't have to face a BLAST hit table with all 100 hits saying "E. coli"

ncbi blast • 4.9k views
ADD COMMENT
0
Entering edit mode

You didn't give us much information here on the goal of what you are trying to do. Is this for phylogenetic analysis or marker development?

I'm sure someone has a script out there to do this, but I have always done this a somewhat archaic way, using my own suite of databases and then piping that into my phylogenetics pipeline. Since I am mainly interested in phylogenetics, I have to have a way of making sure I have the correct ortholog hit, and not a paralog.

Unless you have a single copy gene you are interested in, how do you know you have the correct hit from each species? How does someone go about designing a BLAST script to do this? Is there a reason why this would not matter in your case?

ADD REPLY
0
Entering edit mode

Hello Josh, thanks for your reply! My purpose is to do phylogenetics. It sounds that your situation is similar to mine, that is, to create a phylogenetics pipeline. I plan to get the best hit from each species and label this sequence with the species name (not strain name) for the subsequent phylogenetics analysis.

I think you got the point that it's hard to target orthologs for sure in this way. I plan to compromise by considerring the best hit as ortholog. This strategy is also applied in some previous phylogenetics pipelines such as PhyloGena, though some other specialized programs such as OrthoMCL were designed with more sophisticated considerations. Do you have any suggestions?

ADD REPLY
0
Entering edit mode

My suggestion is to make a tree with all of your BLAST hits. Make sure your alignment is as good as it can be. Select your clade of interest. Purge the other sequences. Wash and repeat from the BLAST step if needed.

There is some consolation in knowing that you're not going to get everything (I'm getting new bacterial genome emails multiple times a day), but it's more important to choose your clade, if you want orthologs and/or paralogs you can find out by phylogenetic methods on well aligned sequences.

ADD REPLY
0
Entering edit mode

Thanks! I can eyeball a tree to find clusters of orthologs. However, is it possible to automate this process? For example, let the program tell if hit #1-20 are clusterred in one clade, and hit #21-25 are not? Of course, the tree is already made.

ADD REPLY
0
Entering edit mode

I'm not sure what you mean by "let the program tell"? You're referring to automated clustering here? There are ways to do this, but if you're making a phylogenetic tree, you're already doing more than the majority of clustering programs do and then you can also get an assessment of how similar your sequences are and where their homology is the greatest.

...to answer your question there are lots of automated clustering programs out there and it's possible to automate a phylogenetics pipeline. The one I have is just a series of bash shell scripts piped into one another; very simple, but it's about what my small bioinformatics skill set can easily handle.

ADD REPLY
0
Entering edit mode

I guess I did mean automated clustering, but I'm not familiar with the term. I'm trying to explain my goal like this: Given a phylogenetic tree, say, built by RAxML. I want to find a program which can tell me if taxon #1, 2, 3, 7, 11 form a monophyletic group on this tree. Do you have some specific recommendations?

I do believe that once I have a tree I should already have all the clusterring information during the tree-building process. But I don't know how to extract this information automatically.

So far the best I can do is to compute a distance matrix based on the tree, and compare how similar these taxa (sequences) are between each other. But I don't know how or if I can translate this into "clusterring" information.

Excuse me for my bad explanation. I am googling the terms you mentioned now...

ADD REPLY

Login before adding your answer.

Traffic: 1454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6