4.9 years ago by
I would download the reference sequence collections for your virus of interest or some set of target virus species. From there compile your own blast database and align your reads against this database. By limiting the search space to only the reference sequences of species of interest you speed the process up and avoid having to do any tedious BLAST output file parsing and filtering.
This will cut down on published sequences of specific genes, clinical isolates and so on. Many species of virus have been sequenced time and time again either from primary literature or clinical isolates. If you have an ortholog to one such species, you're going to have a huge pile of messy results. For now, trim the excess out. If you're interested in detecting more fine grained detail on a sub species level you can go back to these once you know what species you're working with.
Another reason to select target species is to avoid confounding results due to orthologous regions or genes between species of virus. If your interest is in viral driven oncogenesis in humans, plant and fish viruses are probably of little interest. Additionally, other classes of human viral pathogen should be easy to exclude. I doubt you would have to consider CCHFV, Machupo or Nipah as potential oncogenic viruses. If you are worried about typically non-traditional viruses oncogenic viruses being present, try and come up with candidates. I see zero reason to include every species of virus for which there is sequence data present.
This approach also solves the expression vector problem.
I would really avoid trying to download things with entrez queries and biopython, you'll end up with piles of stuff you don't care about. The link below provides a means of filtering viruses taxonomically and by host with the ability to download sequences in bulk.
Worst case scenario you can do some manual hunting. I know everyone here wants to write a script (me included), but sometimes it really is faster just to manually hunt down some sequences.
4.9 years ago by
pld ♦ 4.7k