We are working on using sequencing to identify novel viruses using blast. The idea is to sequence siRNAs from plants and use blast to find virus associated sequences. We are using the viral refseq database and using blastx. The problem we are running into is that even with low E-value cutoffs (10E-20), we are getting a lot of false positives. 

By false positive, we mean that the blast result shows a virus hit, but when we blast that contig again, we get matches from a plant genome. How can we filter our results to ensure the hits that say virus are actually viruses?



biobio40

When you blast it again, you blast it against the same database of viruses, correct? How can you get plant results then?

RamRS

No, sorry. We take the interesting results and blast it against NR using the web interface. 

biobio40

That's why you're seeing plant results - because they are always a better fit than viral seqs when not filtered by organism.

RamRS

But if the sequences are actually from viruses, shouldn't viruses be the best hit?

biobio40

They're not sequences from viruses, they're small plant (host) molecules that target complementary nucleotide sequences. What these things complement could be either host or foreign (viral).

When you get hits against plants for a given siRNA, I can think of two reasons:

    1. You're just finding that siRNA in the plant's genome

  - or -

    2. You're finding that siRNA's target in the plant's genome

pld

Ah okay, that makes sense. So when doing blast against the viral database, is it possible to remove the plant hits without doing a blast against NR?

biobio40

Well for one, I'm not sure why you're using BLASTX, siRNAs are sequence specific and target mRNA molecules, not proteins. So using BLASTX doesn't make any sense here.

This is the tricky part, on one hand you should still search against the host, but on the other hand even if a siRNA matches a host gene, it still may have anti-viral activity in vivo (either through silencing a gene needed by the virus or by silencing the virus directly).

I would also narrow my search down to plant viruses, no sense in searching animal viruses.

pld
