retrieving best hit in local blast output
1
0
Entering edit mode
7.6 years ago
ashish ▴ 680

I have this doubt about retrieving best hit from the blast tabular output. Based on the aim I retrieve top hit by filtering the output in excel on the basis of gaps, mismatch, query coverage, etc. Now I saw that the raw output is already arranged in the order of sequence similarity i.e the 1st entry is the top hit for each query. So what if we just remove duplicates from the tabular output in excel. Like this only the first entry for each query will be left which should be the top hit. I want to know if this is correct or not. Is removing duplicates enough.

*ignore those queries which map at multiple locations on a single subject for a time being.

blast local blast • 19k views
ADD COMMENT
2
Entering edit mode

In order to appropriately answer your question we need to understand the reason you are asking it. Why do you want ot select only one blast match for each sequence and why it has to be the top hit. From statistical and algorithm perspective there is not much difference between the first and the second best hit in terms of being correct is they both have very low p-value (high similarity score). In statistic we use blast and other algorithm only as a statistic test for which we can estimate the power of the test and that power is predicted by the tool usually at least in partially by estimating p-value of a finding given a random object (ie random sequence but with the same nucleotide composition, etc).

Let's imagine that you are testing with blast if a given short contig from de novo assembly of a chloroplast genome is due to bacterial contamination. You get 3 hits from blast with scores that pass your thresholds. Let's say that the top hit is to bacteria and two other hits are to chloroplasts of related subspecies from the same specie that you study. Does it mean that you realy observe a contig from real chloroplast or from a bacteria. If you will look only at the top hit you will say that this is likely bacterial contamination. If you will look at the whole picture you might say that another type of test is needed. This is why for us to give you the right answer, we have to know well why do you want to select only the top hit and what should be considered the top hit then.

ADD REPLY
0
Entering edit mode

I was just wondering if we can do it like this for some special cases, I am not doing any actual analysis. Lets say we have some genes and we want to find out their chromosomal coordinates, and we are sure that none of those genes is present as multiple copies. Here if we just remove the duplicates from the tabular output which will keep just the first result for each query, will we get the top hit like this and is this approach correct in special cases like this.

ADD REPLY
1
Entering edit mode

If you know the exact sequence of the desired gene for the subspicie and know it's genome sequence, than you may use blast in order to find this gene location.

In case you are working with gene sequence from one spice or subspicie and with genome sequence from another specie or subspicie, then I usually use contig alignment or at least area that is bigger then just the gene itself, because this is a question about evolution and which gene and how it was inherited together with its function and neighboring genes

ADD REPLY
0
Entering edit mode

Blast can be used fir short contigs of say 10k nucleotides easily, and genome alignment algorithms work better for longer contigs

ADD REPLY
0
Entering edit mode

I will be working on wheat genome next month. Its genome has not yet been fully sequenced. I will take care of what you said while assigning chromosomal coordinates to genes.Thanks

ADD REPLY
0
Entering edit mode

Are you waiting for the next IWGSC assembly to come out? Is there any reason that you cannot use TGACv1?

ADD REPLY
0
Entering edit mode

What would you use for genome alignment?

ADD REPLY
0
Entering edit mode

i will use TGACv1, it has contigs greater than 500bp which is a good thing. Also you cannot wait for the new version to come, what if it comes after 1 year. Current version is good enough to give you a fair idea about your query sequences.

ADD REPLY
1
Entering edit mode
-max_target_seqs 1
ADD REPLY
0
Entering edit mode

I've considered that but there is some issue with max_target_seqs link I am not very confident with it. What do you think about removing duplicate ids?

ADD REPLY
2
Entering edit mode
7.6 years ago
JV ▴ 470

Well basically, since the blast output is already sorted based on score and evalue, filtering in the way you have in mind would be LARGELY similar to using "-max_target_seqs 1" BUT not identical. The reason is that with the local alignment method used by BLAST, you often do not get one alignment per hit but instead get multiple HSPs (high scoring segment pairs). If you filter duplictates from the results based on query- and subject-name, you will only get the highest scoring segment pair of each hit. Depending on what you want this may be enough, though. (e.g. if you just want the name/id of the respective "best" hit). But keep in mind that BLAST does NOT guarantee that it always gives you the BEST alignment in all cases. Often the second and thirst best hits may be equally (or sometimes even more) appropriate. This is why BLAST based taxonomic classifications are often done using a "least common ancestor" approach based multiple blast hits per query (e.g. using MEGAN or blast2lca)

ADD COMMENT
0
Entering edit mode

Thanks for your input JV, I will check out the tools you've mentioned. What do think, is removing duplicates a good idea in some special cases like the one mentioned in my reply to Petr's comment above.

ADD REPLY
1
Entering edit mode

Well it sounds as if you only want to find the position of a certain set of genes within a genome of which you KNOW the queries are present (just don't know where exactly). So yeah, I think that is a case were you could safely rely on the first hit of each query filtered from the results-table. You would not get infos on multiple gene copies and COULD get misleading results if you search non-identical genes (so, only similar) or for genes of which you are NOT certain they are in the genome with identical sequence. But this does not seem to be the case here.

ADD REPLY

Login before adding your answer.

Traffic: 955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6