Question: Significance of reads mapping to Viruses - FusionCatcher
gravatar for Joel TM
3.9 years ago by
Joel TM50
Joel TM50 wrote:

Good day,  

I am currently analyzing RNAseq data based on 16 lung cancer tumor samples. While testing for fusions with different algorithms, I added the parameters to look for reads mapping to viral genomes in fusionCatcher, just because I could. It uses the following DB : and I got in almost all my patients, reads mapping (ranging from 50 to 700) to a repeating 3-4 viral genome. I am looking for advices/insights into what is available to me if I want to make sense of all this and confirm/infirm that information. If anybody has gone through the process I would greatly appreciate your input.

Have a great day,


virus rna-seq fusion reads • 1.9k views
ADD COMMENTlink written 3.9 years ago by Joel TM50

What kinds of viruses?

Some viruses are known to cause cancers, others might be normal flora or could have been introduced during the collection, or even as simple as being in the air when the patient was breathing.

Lastly, is fusionCatcher the right tool for this?

If you're interested in exploring what viruses might be there, you might want to see how these reads line up. If all of the hits are on a single viral gene, I would say you're seeing a false positive.

If I had to wager, I'd say these were false positives. Depending on the library was prepared, you should be able to pick up replicating viruses via their viral mRNAs. If you're talking RNA viruses you would have genomes in there as well.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by pld4.8k

Thank you for your insight. I too took it with a grain of salt. Here is an example of a recurring "virus" :


As this is not my area of expertise, I welcome any comments from the cummunity. I tested a few reads using UCSC and some map to 4-5 places in the Mouse genome (different Chr) for 60bp long when it maps for 22 bp long on the Human genome.

I guess the ultimate question is, should I put weight on these outputs, and if not, what would be the best way to find out about viruses present in the samples: plain reads alignment on viral genomes ? I could also just open a new thread.

ADD REPLYlink written 3.9 years ago by Joel TM50

I would suggest mapping all of your reads (not just ones caught by FusionCatcher) to that genome and the others with BWA. I'm not sure how FusionCatcher calls things, it might have missed more data.

As far as validating it, I'd start out with PCR. If that turns up I'd then suggest checking again with westerns, if both turn out positive it is probably time for EM (assuming you have tissue). You'd also want matched healthy tissue and tissue from cancer free individuals.

This only shows that the virus is present, showing it has anything to do with cancer, or is even associated with cancer, is a massive thing. However, the genome above was reported to have been found in a retrovirus in a mouse myleoma cell line, I think it is at least interesting enough to give (viral genome) PCR a shot.

Was this library random hexamer or polyA primed? That could have greatly impacted the amount of viral RNA that ended up getting sequenced.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by pld4.8k

Thank you, this has confirmed a few things. We do have the tissue from each patients and PCR is no problem. The person who has prepared the libraries told me it was Random Hexamer. Does that change anything ?

ADD REPLYlink written 3.9 years ago by Joel TM50

Mostly curious, if I recall correctly, random hexamer can be better for getting virus + host in one shot. Although for some of our in vitro stuff, we've done two libraries to get host/virus, I think it was random hexamer to get viral reads, I'll have to check.

However, I think type C (gamma) RVs are ssRNA genomes that RT when integrating. So if the virus was present you'd expect sufficient reads, however I'm used to higher titers than what might be going on in your case.


ADD REPLYlink written 3.9 years ago by pld4.8k

I would suggest NOT to use BWA when dealing with RNA-seq data! BWA is not splice aware and it does not use the already known splice sites info and therefore BWA will always performs worse when aligning reads from RNA-seq data compared with aligners which make use of the already known splice sites!

ADD REPLYlink written 3.9 years ago by enxxx23230

I think OP is attempting to figure out if the issue is worth exploring further. OP isn't attempting to measure expression of viral mRNAs, but wants to tell if there's actual live virus in these samples, or at least see if there's enough reads available to warrant further pursuit of that issue.

Given that it sounds like OP is after retroviruses, BWA is more than sufficient for aligning RNA-Seq reads to viruses with RNA genomes.

In general, seeing viral genomic RNA is a much stronger sign than seeing reads line up against viral mRNAs, especially any non-coding regions. I'd wager this is especially true for retroviruses, where replication cannot occur without generation of the RNA genome.


ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by pld4.8k

Still BWA is the wrong tool here to use because the one needs here to create a synthetic genome which is composed of human genome plus the viruses genomes. This is needed in order to make sure that an aligner finds the best alignment for the reads. To be more specifically for each read the aligner needs to decide where it maps better. Does it map better on the virus or on the human genome? 

Therefore it is essentially, that one uses an aligner which is aware of exons/introns information and simply BWA was never built for this. Those samples of OP contain RNA from  human (rat/mouse?) and some viruses!


ADD REPLYlink written 3.9 years ago by enxxx23230

When I have samples with both viral and host RNA, I just filter out viral reads by what BWA did or didn't map to the viral genome. I never have issues with that.

ADD REPLYlink written 3.9 years ago by pld4.8k

The best way to deal with this is to create a synthetic genome which contains the human genome (if your sample is from human) and all the viruses/phages/bacteria genomes from . This "new" synthetic genome would have around 6GB in size. If your samples come from an RNA experiment, also then it is recommended to use an aligner which has been developed for RNA alignment (for example, STAR, TopHat, HiSat, etc.). BWA has been developed mainly for DNA and it does not take advantage of already known splice sites..

Did you look to VCaP cells? VCaP cells are known to contain Murine viruses. See:

ADD REPLYlink written 3.9 years ago by enxxx23230

These were human lung cancer tumor samples, so primary tissues (FFPE?). One thing about the NCBI virus genome list is that it is massively incomplete or only contains one/a few of the known strains/species of a virus/genera of viruses.

It would be interesting to see if any of the reads map to HEVs.

ADD REPLYlink written 3.9 years ago by pld4.8k

Actually, NCBI virus/bacteria/phages genomes, from here : , has the most important viruses and their strains. For example it has the most important hepatitis viruses, epstein-barr viruses, HIV virus, etc.

ADD REPLYlink written 3.9 years ago by enxxx23230

I think calling the data on that page the "most important" is seriously wrong. That page contains what you could call canonical isolates/species, but does not contain the most important one.

Even then, what is meant by "most important", the isolate that killed the most people, the one that causes the most disease, the first one that was discovered, the one that causes the greatest healthcare burden, the one that researchers stuck with early on, or something else I missed?

Not only is intraspecific variation very important for understanding the nature of viruses and virus-host interactions, but the list is incomplete.

ADD REPLYlink written 3.9 years ago by pld4.8k

I think that is completely wrong to call NCBI virus genome list as MASSIVELY incomplete (as you wrote)!  It has a large selection of viruses and always there will be a virus which one can claim that is missing. 



ADD REPLYlink written 3.9 years ago by enxxx23230

Interesting ! I am looking forward to hear what others have to say.

ADD REPLYlink written 3.9 years ago by Chirag Nepal2.2k

Joel, but should you not map the sequences directly to the viruses genome ? If i get it right, you mapped (by tophat ?) to the human genome and by using fusionCatcher by found it mapped to viral genome? Can you please elaborate it.

ADD REPLYlink written 3.9 years ago by Chirag Nepal2.2k

FusionCatcher uses a bunch of algorithms. Indeed, it uses GMAP, Bowtie, BLAT, STAR etc... to map the reads to the human genome.  if you add the "-V" to your command line, you map reads to known viruses genomes (from what I understood).
Here is what the manual says: " If it is set then the SAM alignments files of reads mapping on viruses genomes are saved in the output directory for later inspection by the user ". I am trying to make sense of them and understand how to interpret.

ADD REPLYlink written 3.9 years ago by Joel TM50

It is ok to map the reads directly on genome viruses! FusionCAtcher is not using TopHat as far as I know!

ADD REPLYlink written 3.9 years ago by enxxx23230
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1693 users visited in the last hour