We are working on using sequencing to identify novel viruses using blast. The idea is to sequence siRNAs from plants and use blast to find virus associated sequences. We are using the viral refseq database and using blastx. The problem we are running into is that even with low E-value cutoffs (10E-20), we are getting a lot of false positives.
By false positive, we mean that the blast result shows a virus hit, but when we blast that contig again, we get matches from a plant genome. How can we filter our results to ensure the hits that say virus are actually viruses?