This is a very long question. But hopefully some of you will find it interesting.
I was given several hundred BAM files representing samples from cancer patients. Judging by the read names and quality scores, these were HiSeq PE-50 reads. The BAMs were created using MapSplice and RSEM.
I was asked to take these BAM files and realign the reads with parameters such that there would be no discordant alignment and no mismatches. The investigator also wanted to omit any reads which mapped to more than one location under these criteria. The read names on the original BAMs all ended with one or more of
/2, indicating which one of the pair the read belonged to. All of the BAM to FASTQ conversion scripts I tried didn't like this convention, so I wrote a quick script to trim the
/2 off of each read name. I sorted the BAMs by read name and was then able to create FASTQ.
After some trial and error I settled on the following parameters for TopHat2 2.0.9:
--no-mixed --max-intron-length 800000 --no-discordant --output-dir [output] --max-multihits 2 -p 8 --b2-very-sensitive --library-type fr-unstranded [genome] [fastq1] [fastq2]
--max-multihits 2, my intention was to assess the proportion of reads that mapped to more than one location, examine them in IGV and perhaps calculate specific metrics based on what I found, and then filter them out in accordance with initial requirements afterwards.
While the original TCGA BAMs had some messy alignments (inconsistent read-pair orientation, lots of singleton and discordant alignments, highly-inconsistent insert size), the alignments I got with the Tophat2 parameters above resulted in some nice-looking alignments with an alignment rate of 94%.
The investigator and his team were happy with what they saw, until one person noticed one particular splice junction that wasn't being replicated in the TopHat alignments. It was located at chr8:128,427,270 - 128,427,687. Here's a screenshot from IGV which shows what I'm talking about. The top half shows my alignments. The bottom half shows the original alignments.
I examined the CIGAR strings for the reads and saw that they were mapping perfectly. Each had a CIGAR string akin to
X perfectly matched bases, an insert of
Y bases, and
Z perfectly matched bases. The original program didn't create an alignment score, but the mapping quality scores were typically 66.
I created a text file with the read names, and used this to search the TopHat2
unmapped.bam files. To my surprise, all of the reads were turning up in
Given that the unmapped.bam file was very small in relation to the accepted_hits.bam file, (150 MB vs 8 GB) I decided to take a brute-force approach with these mysterious reads. I converted the unmapped.bam read pairs into two FASTQ files. I ran these FASTQ files through TopHat with increasing levels of leniency. I also tried aligning the reads as single reads instead of paired-end. Some of the settings I used include:
Being more lenient by allowing discordant & mixed alignments and using
--b2-very-fast. I also tried leaving off the
--max-intron-length 800000 --output-dir [output] --max-multihits 2 -p 8 --b2-very-fast --library-type fr-unstranded [genome] [fastq1] [fastq2]
This resulted in at 38% overall read alignment rate
Mapping the FASTQ files separately as single-read (I also used --b2-very-fast):
--max-intron-length 800000 --output-dir [output] --max-multihits 2 -p 8 --b2-very-sensitive [genome] [fastq1]
When I do this with lenient settings, 61.2% of the forward reads and 39.2% of the reverse reads from the unmapped reads align.
Supplying a GTF file:
--no-mixed --max-intron-length 800000 --no-discordant --output-dir [output] --max-multihits 2 -p 8 --b2-very-sensitive --library-type fr-unstranded -G [UCSC genes.gtf] [genome] [fastq1] [fastq2]
This resulted in a 39.6% overall read alignment rate.
I looked at each of the accepted_hits files in IGV, but in every case the read which originally mapped across the splice junction in the original BAM would not replicate that splice junction. For the single-read files, I saw that the read which was the mate of the read aligning across the splice junction of interest mapped to the genome, usually in the same location as the original alignments. Rarely, a read which originally mapped to the splice junction of interest would map elsewhere in the genome, but typically the read would end up in
I'm totally baffled. I verified that each base of these reads which TopHat throws out is a perfect match to the reference. While I have found a few other splice junctions in the original TCGA BAM files which aren't replicated in the TopHat alignments, the vast majority of splice junctions are, and these matches have a wide range of size.
Does anyone have any idea what could be going on here?
UPDATE 11 October
I excised this 2000-base region of interest containing the mystery splice junction from the reference genome and aligned the original set of reads (not just the unmapped) against this tiny genome snippet in single-read mode using the following command:
--segment-length 15 --output-dir [output] --b2-very-sensitive ./chr8roi/chr8roi [fastq1/2]
I found that, when I combined the results of both alignments, most of the reads with the original splice junction of interest mapped to this ROI and included the splice! But these are the same reads that get thrown away by Tophat when I align against the full reference! I even double-checked by running the same command against the full reference:
-p 8 --segment-length 15 --output-dir [output] --b2-very-sensitive [genome] [fastq1/2]
Still confused. Surely multi-threading isn't the problem? I'm going to do a single-threaded run over the weekend and see if that makes any difference.