Question: Identifying Viral Sequences In Next Generation Sequencing Data
gravatar for Wayne
6.6 years ago by
United States
Wayne990 wrote:

I am attempting to find viral sequences in the RNA seq data of sequenced tumors. Currently I have changed bam files into fasta files and have used the refseq viral dataset from NCBI, formatted it as a blast database, installed BLAST locally, and used the fastafiles from the bam files as the query against the viral BLAST database. This is painfully, slow and most of the results I'm getting seem to be expression vectors or other garbage. Any ideas on a better way to do this, or perhaps a better way to filter the results from blast ?


rna-seq blast sequencing • 8.0k views
ADD COMMENTlink modified 5.1 years ago by pld4.8k • written 6.6 years ago by Wayne990

What's the best way to download NCBI's viral sequences?

Doing a rapid search, it ts possible to find 1526387 results.[Organism%3Aexp]

But what's the best way to download it? I'm trying to use Biopython + Esearch, but it seems that many sequences are missing.

ADD REPLYlink written 5.9 years ago by Leandro Lima920

please, ask this as a new question. Thanks.

ADD REPLYlink written 5.9 years ago by Pierre Lindenbaum118k

Thank you for the tip, Pierre.

Actually I found related posts about it.

NCBI refseq viral genomes

How to download gene sequences from NCBI gene

problem when downloading large number of sequences from Genbank

I'll try a little bit more before creating a new question.

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Leandro Lima920
gravatar for Pierre Lindenbaum
6.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

You could get some inspiration from

"Rapid identification of non-human sequences in high-throughput sequencing datasets"

Bioinformatics (2012) 28 (8): 1174-1175. doi: 10.1093/bioinformatics/bts100

Rapid identification of non-human sequences (RINS) is an intersection-based pathogen detection workflow that utilizes a user-provided custom reference genome set for identification of non-human sequences in deep sequencing datasets. In <2 h, RINS correctly identified the known virus in the dataset SRR73726 and is compatible with any computer capable of running the prerequisite alignment and assembly programs. RINS accurately identifies sequencing reads from intact or mutated non-human genomes in a dataset and robustly generates contigs with these non-human sequences

ADD COMMENTlink written 6.6 years ago by Pierre Lindenbaum118k

Hi Pierre,

I am doing viral detection on RNAseq data and found this post.Have you used RINS for this purpose?Any thoughts on this software ? I just wanted to know if it has given some kind of results that have been validated, or also confirmed by other methods.

ADD REPLYlink written 2.1 years ago by Ron910
gravatar for JC
6.6 years ago by
JC7.6k wrote:

As mentioned before, you can extract the unmapped reads from the BAM file and map them to an indexed collection of viral sequences with Bowtie/BWA, this will be much faster than simply using Blast.

ADD COMMENTlink written 6.6 years ago by JC7.6k
gravatar for Sukhdeep Singh
6.6 years ago by
Sukhdeep Singh9.6k
Sukhdeep Singh9.6k wrote:

In case of ChIP-Seq, I would map the reads with the viral genome using any of the mapper [BWA or Bowtie]. So, you might do it same with the tophat (which uses bowtie2), just with different genome and count how much mapped.

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Sukhdeep Singh9.6k

I might map reads to the host genome first, take the unmapped reads and then bowtie/bwa map the resulting reads (which are more likely I assume to be of viral origin) to the viruses as a separate bwa/bowtie indexed database.

ADD REPLYlink written 6.6 years ago by Daniel Swan13k

Yeah, considering the small homology b/w host and viruses, absolutely makes sense or one can take the virus mapped set of reads and map it to host, to see how much is the loss. But the benefit of other way round (like you are saying is), you can take this set and map to a number of different genomes and count the number.

ADD REPLYlink modified 6.6 years ago • written 6.6 years ago by Sukhdeep Singh9.6k
gravatar for seidel
6.6 years ago by
United States
seidel6.8k wrote:

You might check out Joe DeRisi's work (UCSF). He's been doing this kind of thing for a long time, and discusses a lot of the caveats in various publications.

Expression vectors should be easy to filter out. Are you also filtering out human sequences prior to BLAST? Have you considered incorporating BLAT, or creating some kind of alignment index from the viral data set so that you can easily identify matching viral reads? Is your painfully slow part running BLAST? Or filtering the BLAST results?

ADD COMMENTlink written 6.6 years ago by seidel6.8k

I have already filtered out the human transcriptome, and the BLASTing of the left over reads against the Viral BLAST db is definitely slow, but the slowest part is knowing how to filter all the garbage that comes out.

ADD REPLYlink written 6.6 years ago by Wayne990
gravatar for pld
5.1 years ago by
United States
pld4.8k wrote:

I would download the reference sequence collections for your virus of interest or some set of target virus species. From there compile your own blast database and align your reads against this database. By limiting the search space to only the reference sequences of species of interest you speed the process up and avoid having to do any tedious BLAST output file parsing and filtering.

This will cut down on published sequences of specific genes, clinical isolates and so on. Many species of virus have been sequenced time and time again either from primary literature or clinical isolates. If you have an ortholog to one such species, you're going to have a huge pile of messy results. For now, trim the excess out. If you're interested in detecting more fine grained detail on a sub species level you can go back to these once you know what species you're working with.

Another reason to select target species is to avoid confounding results due to orthologous regions or genes between species of virus. If your interest is in viral driven oncogenesis in humans, plant and fish viruses are probably of little interest. Additionally, other classes of human viral pathogen should be easy to exclude. I doubt you would have to consider CCHFV, Machupo or Nipah as potential oncogenic viruses. If you are worried about typically non-traditional viruses oncogenic viruses being present, try and come up with candidates. I see zero reason to include every species of virus for which there is sequence data present.

This approach also solves the expression vector problem.

I would really avoid trying to download things with entrez queries and biopython, you'll end up with piles of stuff you don't care about. The link below provides a means of filtering viruses taxonomically and by host with the ability to download sequences in bulk.

Worst case scenario you can do some manual hunting. I know everyone here wants to write a script (me included), but sometimes it really is faster just to manually hunt down some sequences.

ADD COMMENTlink written 5.1 years ago by pld4.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour