I am performing tblastn with a set of >1000 proteins as queries against a genome.
I am trying to keep every regions of my genome that match a query protein (evalue > 1e-10) but in many cases, 1 genome region will have many hits (several queries in the same region). This is mostly due that my proteins are all similar (same gene family)
For example :
query1 hit scaffold 1 from coordinates 60 to 120 (E = 1e-5)
query2 hit scaffold1 from coordinates 70 to 110 (E = 1e-3)
To filter those results, i would like to find a way to : 1) Find regions with overlapping queries 2) Keep only the best-hit on these regions (based on e-value)
(here, i would keep coordinates 60 to 120 on scaffold 1)
I have a tabular output from blast (outfmt 6) but i can't find an efficient way to apply such filters.
I would prefer something in R or bash but i could try to understand other languages.
Thanks for your help,