Suppose I have already run tophat on an Illumina data set consisting of paired (2x100) reads as follows:
tophat -p 4 --transcriptome-index /transcripts/all_sequences --mate-inner-dist 150 --mate-std-dev 50 -o /tmp/fusion_test/ /ref_genome/all_sequences.fa read1.fastq read2.fastq
Would it be inadvisable to filter this BAM and remove all reads that are obviously not likely to provide fusion evidence (e.g. map unambiguously with high quality to the reference genome) and feed this much smaller BAM into a new
tophat --fusion-search command? That way tophat does not have to go through the process of identifying unmapped reads again.
Based on the tophat-fusion documentation: 'getting started' and 'manual' there would be a few differences in the way my initial alignments were generated compared to starting with tophat-fusion from scratch with the entire set of reads:
- The BAM I already have would be generated using Bowtie2 instead of Bowtie1. The docs suggest using Bowtie1 because it is faster, but if I already have committed the computational resources to this alignment are there any other issues with respect to Bowtie2 and fusion detection?
- The tophat-fusion docs seem to indicate that a larger --mate-std-dev is desirable. The example in the docs is 80 compared to our 50.
Just be clear I am not proposing to use these alignments directly for fusion detection. But rather to create fastq files that contain only unmapped reads that might be useful in fusion detection. If tophat-fusion expects chimeric reads to be drawn from a population of normal transcriptome reads and performs some stats, considers read depth near the putative fusion breakpoints, etc. then this strategy could be problematic.
It seems that from the manuscript this may indeed be a problem: TopHat-Fusion: an algorithm for discovery of novel fusion transcripts
"TopHat-Fusion prefers reads that uniformly cover a 600-bp window centered in any fusion point" (manuscript Figure 5).
Many of these reads will have been filtered out in the process I describe above.
Similarly, the scoring/ranking of fusions would likely be affected
"A scoring scheme of how well distributed reads are around a fusion point; these result scores are used to sort the list of candidate fusions." (manuscript Figure 6)
One final lingering question I have relates to the option '--keep-fasta-order'.
"In order to sort alignments in the same order in the genome fasta file, the option can be used. But this option will make the output SAM/BAM file incompatible with those from the previous versions of TopHat (1.4.1 or lower)". (Tophat documentation).
Why is this option recommended for tophat-fusion?