While trying to figure out why an unusually low percentage of my right pair (R2) RNAseq reads were aligning with tophat (~45%), I noticed something strange in my fastqc1 output2: an extremely high over-representation of homopolymers (any of A,C,G, or T) at the 3' end.
These reads had already been processed to remove illumina adaptor sequences and low-quality bases from the ends; as such these bases are high-quality calls. To check if these sequences were affecting tophat, I lopped off 20 bases from the 3' end, and read mappability jumped to 85% as per usual, confirming that these are untemplated bases that bowtie1, an end-to-end mapper, cannot handle (I am using bowtie1 instead of bowtie2 as this is a gene fusion pipeline).
This is a batch-specific effect: I have observed the same issue in all samples from a specific sequencing run (i.e. they were all sequenced on the same flow cell). In addition, the base quality at the 3' end was especially poor in these samples, necessitating a first round of quality trimming, prior to which the tophat mapping rate was even lower (<10%). The issue is also specific to the right pair reads -- left pairs are fine. I'm just wondering if anyone has observed these homopolymeric runs before and has been able to track down their source.
If this is a batch specific effect perhaps you should ask this question to the person who prepared those libraries (hopefully he has full protocols for both cases to compare).
Library prep and sequencing was done by the same facility for all batches, just spread out over a few months. I would guess that the same protocol was used in each case, but just that something strange might have happened in this particular case to cause this effect. I will certainly bring this to their attention, but I posted here thinking this might be common enough that someone would have already experienced the issue and figured out the cause.
Dear Mikhail,
I'm currently working with Oncofuse to help characterize my fusion calls made by Tophat-Fusion. I was reading the paper and noticed that the first author's name sounded familiar and reminded me of this post from awhile back, which is fortuitous because I am having a problem with Oncofuse that I hope I can ask you about (apologies for the inappropriate use of these forums but I thought it was worth a shot). The problem is that for a given fusion event, Oncofuse seems to consider both the fusion and its reciprocal. However in RNAseq, usually there is only evidence for one of these fusions actually being expressed (i.e. spanning and encompassing reads only map across one or the other possible fusions). This leads to what appears to be mistakes by Oncofuse where it claims a very high "driver probability" but for the wrong fusion, i.e. the reciprocal fusion for which expression evidence doesn't exist. Is this intended behaviour, a bug, or am I misinterpreting the output? Again, I apologize for using this forum to contact you directly. I would be very grateful for any advice.
Sincerely,
Ethan
Hello Ethan!
Sure, no problem to ask here, just keep in mind that there is a dedicated thread for Oncofuse, Oncofuse: Prediction Of Driver Gene Fusions From Ngs Data. You are correct that Oncofuse does not take into account the sample expression data, yet it is able to distinguish a fusion and its reciprocal. For example the BCR-ABL1 gets P-value of 1e-8, while ABL1-BCR gets P=0.88, according to a simple check with this data (coord format):
I cannot rule out false-positives of course, so you might want to pre-filter according to fusion expression level, e.g. number of encompassing/spanning reads. If you still feel that this is an erroneous behavior, you can mail me a sample of your gene list so I can check it.
Hey, thanks for your response. I have a follow up but I will post it on the dedicated thread. Cheers!