I have this doubt about retrieving best hit from the blast tabular output. Based on the aim I retrieve top hit by filtering the output in excel on the basis of gaps, mismatch, query coverage, etc. Now I saw that the raw output is already arranged in the order of sequence similarity i.e the 1st entry is the top hit for each query. So what if we just remove duplicates from the tabular output in excel. Like this only the first entry for each query will be left which should be the top hit. I want to know if this is correct or not. Is removing duplicates enough.
*ignore those queries which map at multiple locations on a single subject for a time being.
In order to appropriately answer your question we need to understand the reason you are asking it. Why do you want ot select only one blast match for each sequence and why it has to be the top hit. From statistical and algorithm perspective there is not much difference between the first and the second best hit in terms of being correct is they both have very low p-value (high similarity score). In statistic we use blast and other algorithm only as a statistic test for which we can estimate the power of the test and that power is predicted by the tool usually at least in partially by estimating p-value of a finding given a random object (ie random sequence but with the same nucleotide composition, etc).
Let's imagine that you are testing with blast if a given short contig from de novo assembly of a chloroplast genome is due to bacterial contamination. You get 3 hits from blast with scores that pass your thresholds. Let's say that the top hit is to bacteria and two other hits are to chloroplasts of related subspecies from the same specie that you study. Does it mean that you realy observe a contig from real chloroplast or from a bacteria. If you will look only at the top hit you will say that this is likely bacterial contamination. If you will look at the whole picture you might say that another type of test is needed. This is why for us to give you the right answer, we have to know well why do you want to select only the top hit and what should be considered the top hit then.
I was just wondering if we can do it like this for some special cases, I am not doing any actual analysis. Lets say we have some genes and we want to find out their chromosomal coordinates, and we are sure that none of those genes is present as multiple copies. Here if we just remove the duplicates from the tabular output which will keep just the first result for each query, will we get the top hit like this and is this approach correct in special cases like this.
If you know the exact sequence of the desired gene for the subspicie and know it's genome sequence, than you may use blast in order to find this gene location.
In case you are working with gene sequence from one spice or subspicie and with genome sequence from another specie or subspicie, then I usually use contig alignment or at least area that is bigger then just the gene itself, because this is a question about evolution and which gene and how it was inherited together with its function and neighboring genes
Blast can be used fir short contigs of say 10k nucleotides easily, and genome alignment algorithms work better for longer contigs
I will be working on wheat genome next month. Its genome has not yet been fully sequenced. I will take care of what you said while assigning chromosomal coordinates to genes.Thanks
Are you waiting for the next IWGSC assembly to come out? Is there any reason that you cannot use TGACv1?
What would you use for genome alignment?
i will use TGACv1, it has contigs greater than 500bp which is a good thing. Also you cannot wait for the new version to come, what if it comes after 1 year. Current version is good enough to give you a fair idea about your query sequences.
I've considered that but there is some issue with max_target_seqs link I am not very confident with it. What do you think about removing duplicate ids?