Question

Identifying Viral Sequences In Next Generation Sequencing Data

5

Entering edit mode

11.7 years ago

Wayne ★ 1.0k

I am attempting to find viral sequences in the RNA seq data of sequenced tumors. Currently I have changed bam files into fasta files and have used the refseq viral dataset from NCBI, formatted it as a blast database, installed BLAST locally, and used the fastafiles from the bam files as the query against the viral BLAST database. This is painfully, slow and most of the results I'm getting seem to be expression vectors or other garbage. Any ideas on a better way to do this, or perhaps a better way to filter the results from blast ?

Thanks!!

rna-seq sequencing blast • 10k views

ADD COMMENT • link updated 10.2 years ago by pld 5.1k • written 11.7 years ago by Wayne ★ 1.0k

0

Entering edit mode

What's the best way to download NCBI's viral sequences?

Doing a rapid search, it ts possible to find 1526387 results. http://www.ncbi.nlm.nih.gov/nuccore/?term=txid10239[Organism%3Aexp]

But what's the best way to download it? I'm trying to use Biopython + Esearch, but it seems that many sequences are missing.

ADD REPLY • link 11.0 years ago by Leandro Lima ▴ 970

0

Entering edit mode

please, ask this as a new question. Thanks.

ADD REPLY • link 11.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you for the tip, Pierre.

Actually I found related posts about it.

NCBI refseq viral genomes

How to download gene sequences from NCBI gene

problem when downloading large number of sequences from Genbank

I'll try a little bit more before creating a new question.

ADD REPLY • link 11.0 years ago by Leandro Lima ▴ 970

score 5 · Answer 1 · 2012-08-22

You could get some inspiration from http://bioinformatics.oxfordjournals.org/content/28/8/1174

"Rapid identification of non-human sequences in high-throughput sequencing datasets"

Bioinformatics (2012) 28 (8): 1174-1175. doi: 10.1093/bioinformatics/bts100

Rapid identification of non-human sequences (RINS) is an intersection-based pathogen detection workflow that utilizes a user-provided custom reference genome set for identification of non-human sequences in deep sequencing datasets. In <2 h, RINS correctly identified the known virus in the dataset SRR73726 and is compatible with any computer capable of running the prerequisite alignment and assembly programs. RINS accurately identifies sequencing reads from intact or mutated non-human genomes in a dataset and robustly generates contigs with these non-human sequences

score 3 · Answer 2 · 2012-08-22

3

Entering edit mode

11.7 years ago

JC 13k

As mentioned before, you can extract the unmapped reads from the BAM file and map them to an indexed collection of viral sequences with Bowtie/BWA, this will be much faster than simply using Blast.

ADD COMMENT • link 11.7 years ago by JC 13k

score 2 · Answer 3 · 2012-08-22

2

Entering edit mode

11.7 years ago

Sukhi Singh 11k

In case of ChIP-Seq, I would map the reads with the viral genome using any of the mapper [BWA or Bowtie]. So, you might do it same with the tophat (which uses bowtie2), just with different genome and count how much mapped.

ADD COMMENT • link 11.7 years ago by Sukhi Singh 11k

0

Entering edit mode

I might map reads to the host genome first, take the unmapped reads and then bowtie/bwa map the resulting reads (which are more likely I assume to be of viral origin) to the viruses as a separate bwa/bowtie indexed database.

ADD REPLY • link 11.7 years ago by User 59 13k

0

Entering edit mode

Yeah, considering the small homology b/w host and viruses, absolutely makes sense or one can take the virus mapped set of reads and map it to host, to see how much is the loss. But the benefit of other way round (like you are saying is), you can take this set and map to a number of different genomes and count the number.

ADD REPLY • link 11.7 years ago by Sukhi Singh 11k

score 1 · Answer 4 · 2012-08-22

1

Entering edit mode

11.7 years ago

seidel 11k

You might check out Joe DeRisi's work (UCSF). He's been doing this kind of thing for a long time, and discusses a lot of the caveats in various publications.

Expression vectors should be easy to filter out. Are you also filtering out human sequences prior to BLAST? Have you considered incorporating BLAT, or creating some kind of alignment index from the viral data set so that you can easily identify matching viral reads? Is your painfully slow part running BLAST? Or filtering the BLAST results?

ADD COMMENT • link 11.7 years ago by seidel 11k

0

Entering edit mode

I have already filtered out the human transcriptome, and the BLASTing of the left over reads against the Viral BLAST db is definitely slow, but the slowest part is knowing how to filter all the garbage that comes out.

ADD REPLY • link 11.7 years ago by Wayne ★ 1.0k

score 1 · Answer 5 · 2014-02-06

I would download the reference sequence collections for your virus of interest or some set of target virus species. From there compile your own blast database and align your reads against this database. By limiting the search space to only the reference sequences of species of interest you speed the process up and avoid having to do any tedious BLAST output file parsing and filtering.

This will cut down on published sequences of specific genes, clinical isolates and so on. Many species of virus have been sequenced time and time again either from primary literature or clinical isolates. If you have an ortholog to one such species, you're going to have a huge pile of messy results. For now, trim the excess out. If you're interested in detecting more fine grained detail on a sub species level you can go back to these once you know what species you're working with.

Another reason to select target species is to avoid confounding results due to orthologous regions or genes between species of virus. If your interest is in viral driven oncogenesis in humans, plant and fish viruses are probably of little interest. Additionally, other classes of human viral pathogen should be easy to exclude. I doubt you would have to consider CCHFV, Machupo or Nipah as potential oncogenic viruses. If you are worried about typically non-traditional viruses oncogenic viruses being present, try and come up with candidates. I see zero reason to include every species of virus for which there is sequence data present.

This approach also solves the expression vector problem.

I would really avoid trying to download things with entrez queries and biopython, you'll end up with piles of stuff you don't care about. The link below provides a means of filtering viruses taxonomically and by host with the ability to download sequences in bulk. https://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi

Worst case scenario you can do some manual hunting. I know everyone here wants to write a script (me included), but sometimes it really is faster just to manually hunt down some sequences.