Question: retrieving best hit in local blast output
gravatar for ashish
2.5 years ago by
ashish220 wrote:

I have this doubt about retrieving best hit from the blast tabular output. Based on the aim I retrieve top hit by filtering the output in excel on the basis of gaps, mismatch, query coverage, etc. Now I saw that the raw output is already arranged in the order of sequence similarity i.e the 1st entry is the top hit for each query. So what if we just remove duplicates from the tabular output in excel. Like this only the first entry for each query will be left which should be the top hit. I want to know if this is correct or not. Is removing duplicates enough.

*ignore those queries which map at multiple locations on a single subject for a time being.

blast local blast • 4.2k views
ADD COMMENTlink modified 2.5 years ago by JV380 • written 2.5 years ago by ashish220
-max_target_seqs 1
ADD REPLYlink written 2.5 years ago by

I've considered that but there is some issue with max_target_seqs link I am not very confident with it. What do you think about removing duplicate ids?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by ashish220

In order to appropriately answer your question we need to understand the reason you are asking it. Why do you want ot select only one blast match for each sequence and why it has to be the top hit. From statistical and algorithm perspective there is not much difference between the first and the second best hit in terms of being correct is they both have very low p-value (high similarity score). In statistic we use blast and other algorithm only as a statistic test for which we can estimate the power of the test and that power is predicted by the tool usually at least in partially by estimating p-value of a finding given a random object (ie random sequence but with the same nucleotide composition, etc).

Let's imagine that you are testing with blast if a given short contig from de novo assembly of a chloroplast genome is due to bacterial contamination. You get 3 hits from blast with scores that pass your thresholds. Let's say that the top hit is to bacteria and two other hits are to chloroplasts of related subspecies from the same specie that you study. Does it mean that you realy observe a contig from real chloroplast or from a bacteria. If you will look only at the top hit you will say that this is likely bacterial contamination. If you will look at the whole picture you might say that another type of test is needed. This is why for us to give you the right answer, we have to know well why do you want to select only the top hit and what should be considered the top hit then.

ADD REPLYlink written 2.5 years ago by Petr Ponomarenko2.6k

I was just wondering if we can do it like this for some special cases, I am not doing any actual analysis. Lets say we have some genes and we want to find out their chromosomal coordinates, and we are sure that none of those genes is present as multiple copies. Here if we just remove the duplicates from the tabular output which will keep just the first result for each query, will we get the top hit like this and is this approach correct in special cases like this.

ADD REPLYlink written 2.5 years ago by ashish220

If you know the exact sequence of the desired gene for the subspicie and know it's genome sequence, than you may use blast in order to find this gene location.

In case you are working with gene sequence from one spice or subspicie and with genome sequence from another specie or subspicie, then I usually use contig alignment or at least area that is bigger then just the gene itself, because this is a question about evolution and which gene and how it was inherited together with its function and neighboring genes

ADD REPLYlink written 2.5 years ago by Petr Ponomarenko2.6k

Blast can be used fir short contigs of say 10k nucleotides easily, and genome alignment algorithms work better for longer contigs

ADD REPLYlink written 2.5 years ago by Petr Ponomarenko2.6k

I will be working on wheat genome next month. Its genome has not yet been fully sequenced. I will take care of what you said while assigning chromosomal coordinates to genes.Thanks

ADD REPLYlink written 2.5 years ago by ashish220

Are you waiting for the next IWGSC assembly to come out? Is there any reason that you cannot use TGACv1?

ADD REPLYlink written 2.5 years ago by cschu1811.8k

What would you use for genome alignment?

ADD REPLYlink written 2.5 years ago by cschu1811.8k

i will use TGACv1, it has contigs greater than 500bp which is a good thing. Also you cannot wait for the new version to come, what if it comes after 1 year. Current version is good enough to give you a fair idea about your query sequences.

ADD REPLYlink written 2.5 years ago by ashish220
gravatar for JV
2.5 years ago by
JV380 wrote:

Well basically, since the blast output is already sorted based on score and evalue, filtering in the way you have in mind would be LARGELY similar to using "-max_target_seqs 1" BUT not identical. The reason is that with the local alignment method used by BLAST, you often do not get one alignment per hit but instead get multiple HSPs (high scoring segment pairs). If you filter duplictates from the results based on query- and subject-name, you will only get the highest scoring segment pair of each hit. Depending on what you want this may be enough, though. (e.g. if you just want the name/id of the respective "best" hit). But keep in mind that BLAST does NOT guarantee that it always gives you the BEST alignment in all cases. Often the second and thirst best hits may be equally (or sometimes even more) appropriate. This is why BLAST based taxonomic classifications are often done using a "least common ancestor" approach based multiple blast hits per query (e.g. using MEGAN or blast2lca)

ADD COMMENTlink written 2.5 years ago by JV380

Thanks for your input JV, I will check out the tools you've mentioned. What do think, is removing duplicates a good idea in some special cases like the one mentioned in my reply to Petr's comment above.

ADD REPLYlink written 2.5 years ago by ashish220

Well it sounds as if you only want to find the position of a certain set of genes within a genome of which you KNOW the queries are present (just don't know where exactly). So yeah, I think that is a case were you could safely rely on the first hit of each query filtered from the results-table. You would not get infos on multiple gene copies and COULD get misleading results if you search non-identical genes (so, only similar) or for genes of which you are NOT certain they are in the genome with identical sequence. But this does not seem to be the case here.

ADD REPLYlink written 2.5 years ago by JV380
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2363 users visited in the last hour