Tophat vs Bowtie alignment rates
6
2
Entering edit mode
2.3 years ago
am3 ▴ 70

Hello! I hope somebody can provide some insight into TopHat and Bowtie2 behavior. I have a data set that I aligned to a reference using Bowtie2 and Tophat, both running default settings. When I used Bowtie, the alignment rates were >90% for all samples. With Tophat, however, the rates were all around 45-55%. Since Tophat uses Bowtie, I'm not sure why the resulting alignment rates are so much lower. My best guess is that when Tophat calls Bowtie, it uses different settings than the default settings for a user using Bowtie directly, but I haven't been able to figure this out from the respective manuals. Could anyone explain why this might be? I'm curious because there's apparently something about my reads that is very sensitive to differences in aligners used, and I want to know what it is.

I'm not sure what additional information is needed to answer this question but I'm ready to provide it. Thank you!

bowtie2 tophat alignment • 1.9k views
2
Entering edit mode

Hello am3930,

If you want to compare two genome aligners you can take a look at BWA and Bowtie2

2
Entering edit mode

I don't want to over-emphasize my point, but I strongly disagree with the conclusion that you shouldn't use TopHat (and that tweet is actually about TopHat1, not TopHat2).

I have some points about that saved here:

(also, for what it is worth, Lior didn't add a comment to that post, but he did tweet about it, considerably increasing viewership)

1
Entering edit mode

I will be short I do not want to argue all day too,

As you pointed out in your blog, in some rare scenario TopHat2 can still be useful.

My point is that most posts on Biostars are about global RNAseq analysis using TopHat which is not the best tool in term of memory usage, running time and accuracy for general RNAseq experiments

https://www.nature.com/articles/nmeth.2722

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5792058/

Before the edit of his thread, OP didn't mentionned if they were using TopHat2 and still did not mention if they have DNA or RNA reads, if they want to discover novel splicing events, do a differential expression analysis...

Lior Pachter's tweet is about TopHat version 1 and in the same vein created TopHat2, HISAT and HISAT2

You can also read on TopHat website

Please note that TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality (i.e. spliced alignment of RNA-Seq reads), in a more accurate and much more efficient way.

I'm more inclined to believe someone that wrote down the tool and perfectly know how it works than someone that use TopHat and get good results in specific conditions (no offense at all)

1
Entering edit mode

With genuine thanks to everyone here for taking the time to answer, I'd like to gently suggest that we move away from discussion of the "best" tools in some global sense. I'm not looking for advice on how to do my final analysis. As I explained, I'm motivated by curiosity. My goal in THIS instance is to understand why two tools based on the same underlying alignment method produce such wildly different alignment rates. Thank you!

1
Entering edit mode

I think it is important to respect am3's wishes, so I will try to be brief and not have another comment about "best" aligners on this thread:

1) You are right that is posted on the TopHat2 website. I think it is important that the code for TopHat2 still be available (and you can find useful applications that were not explicitly designed by the developers; plus, I even had a period of time when I recommended people not use COHCAP, so I have disagreed with myself as a developer). However, I should be more careful not to imply that the developers agree with my opinion.

2) I think the scenarios where TopHat2 is not useful are actually limited. For example, I have used TopHat2 for all the labs mentioned in this acknowledgement (which I guess could arguably be like a placeholder until there can be some sort of more formal paper; however, I apologize that you need to scroll down to see the long acknowledgement). If something did seem off about the TopHat2 alignment for an important gene, I always tested a STAR alignment (but I never found that changed the trend of expression for any of the genes that I have checked, at least so far). So, if you have 50 bp SE reads, I think use of TopHat2 is usually quite reasonable.

3) Lior is not an author on TopHat2 or HISAT papers (only on TopHat1, and he isn't even acknowledged in the HISAT paper). However, again, I should be focusing more on my first-hand experience and not implying that other people necessarily agree with me (so, I apologize about that).

4) I'll try to take some time to look into the Engström et al. paper more closely, but I already have a response about the Baruzzo et al. paper: the simulated data T2 and T3 categories are less typical of what you would actually observe in an experiment. In terms of showing the alignment rate is lower in more divergent sequences, it actually complements my argument that TopHat2 can be useful as a more conservative alignment (if you want to avoid alignments from unintended sequence / contamination).

5) (update) I agree that there can be differences where TopHat2 may not be as good of an option with paired-end data. I can also believe that there are examples where TopHat2 is not the best option for splicing analysis. --> However, I most frequently encounter ~50 bp SE reads, so perhaps that explains some differences between my observations and the Engström et al. paper (although I actually think Figure 6 in that paper looks pretty good for Tophat2).

(update) That said, thank you very much for sharing those links to reviews. For example, it may not be as relevant for gene expression analysis (which is what I was emphasizing), but perhaps Figure 2b may be worth considering (although maybe GATK functions like SplitNCigarReads and GATK parameters like -dontUseSoftClippedBases can make TopHat2 and STAR results relatively more similar? Plus, to be fair, I would admittedly probably use a STAR alignment for an "initial" analysis of mutation calling for RNA-Seq data, but with alignment post-processing that I wouldn't use for gene expression analysis)

I also apologize that this wan't so brief (which I realized after posting).

Likewise, I was also trying to give some indication of when I updated this comment later in the day (to avoid having a whole different comment that may not be as relevant to the question), so I apologize for the messiness.

0
Entering edit mode

Thank you, I'm aware of the improved methods that exist, but I'm asking the question to satisfy my curiosity now, to better understand both my data and the methods.

According to the manual, Tophat indeed uses bowtie2: https://ccb.jhu.edu/software/tophat/manual.shtml, so I don't think that's it.

I know the general principle of how Tophat works (using bowtie2 to align first, then finding splice sites in unaligned reads). What I don't understand is how the final rate of aligned reads can be so much LOWER using Tophat, which I naively understand to work like "run bowtie2 and then do some more work on the stuff that doesn't align with bowtie2."

0
Entering edit mode

Tophat uses bowtie. Tophat2 uses Bowtie2.

Biggest difference between what Tophat2 is doing and what Bowtie2 is doing is that Bowtie2 can do local or "end-to-end"/global alignment.

By default Bowtie2 uses local alignment, which means only part of the read need map. However when TopHat2 uses Bowtie2 it instructs it to run in end-to-end mode. This means that the whole read needs to map from start to finish.

0
Entering edit mode

Thank you! I do indeed mean Tophat2. The manual just calls it Tophat (whereas the manuals for bowtie2 calls it bowtie2 every time), which is why I was imprecise.

I thought the issue might be local vs. end-to-end, however the bowtie2 manual says it does end-to-end by default: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#end-to-end-alignment-versus-local-alignment

5
Entering edit mode
2.3 years ago
am3 ▴ 70

Hi everyone, I have an answer to this after consulting with one of the developers. Posting in case anyone here is interested or comes across this when googling.

The difference between how Bowtie2 runs by default and how Tophat2 calls Bowtie2 is in the minimum alignment score required to consider an alignment valid. In Bowtie, this minimum is set to -0.6 - 0.6 * read length. Tophat runs Bowtie with the alignment score minimum set to a constant of -14. So Tophat has a much more stringent criterion for which alignments it considers valid. (When Hisat2 calls Bowtie2, the default minimum threshold is set to 0 - 0.2 * read length. With my samples, this resulted in alignment rates around 75% - right in between the Bowtie and Tophat results.)

The developer I communicated with was surprised that this parameter made such a drastic difference in alignment results, so it's worth paying attention to. This is likely because I'm dealing with diverse wild populations that might have significantly diverged from the reference genome.

1
Entering edit mode
2.3 years ago
JC 12k

Tophat map reads with Bowtie, depending on the configuration it does a first alignment to the genome/transcriptome using the full read, unmapped reads are split in smaller reads and it tries to map it again in a spaced alignment (introns). So, there are not the same method and results can be different. Also as Bastien mentioned, use something more recent and improved.

1
Entering edit mode
2.3 years ago

As has already been pointed out. Don't use tophat.

But the probable reason for yoru difference is that Tophat uses Bowtie, and you are comapring it to Bowtie2.

Bowtie and Bowtie2 are not the same thing, they use different algorithms (Bowtie is a global aligner. By default Bowtie2 is a local aligner) and produce quite different results on default settings.

0
Entering edit mode

Maybe you already appreciate the embedded Tweet from Lior is about TopHat1, but I think the original user actually meant TopHat2.

In terms of TopHat2, I would strongly disagree with the claim that it shouldn't be used (which I mention above, but I also have a blog post with some collected points).

Nevertheless, thank you again for your contribution!

1
Entering edit mode
2.3 years ago

Are you working with single-end or paired-end reads? While it is not really a direct answer, maybe some of the points about the TopHat2 alignment may be more important for paired-end reads.

For example, I usually see alignment rates of ~90% with TopHat2 (with ~50 bp single-end reads). So, what you are describing is very different.

Is it possible that you have some non-genomic sequence in your reads? For example, what if you run FastQC on the reads and/or trim adapters (with a program like cutadapt, Trimmomatic, etc.)?

This may not actually solve your problem, but it is the best idea that I could think of at the moment.

0
Entering edit mode

Good questions - the reads are paired-end and have been trimmed and QC'ed. The reports say that of the reads that do map, most map concordantly, so I don't think there's anything weird going on with how it's treating the pairs, but I'm not sure what else I should check.

0
Entering edit mode

Ok - QC can sometimes be hard to absolutely define, but it sounds like having the paired-end reads is not the issue (given the reads have already been trimmed).

There could be things like a high polyA/polyT read count (depending upon your enrichment protocol) that would be different than trimming adapters, but I am guessing that you already checked for that.

You could try just a R1 alignment (and/or test FastQ Screen as another troubleshooting option). However, maybe it is better to see if anybody else can give feedback (or wait until I have another idea).

0
Entering edit mode

I would think that things like polyAs or other generally problematic sequences would be equally problematic in Bowtie and Tophat, no? Unless when Tophat calls Bowtie it calls it using some setting that drastically changes how such reads are treated (which is what I haven't been able to figure out).

0
Entering edit mode

I am not sure - I want to say I've seen a dataset where the lower TopHat2 alignment was due to having extra polyA reads, but I don't remember that for certain. I also specifically remember needing to use for --local -X 2000 to increase the Bowtie2 alignment rate for ATAC-Seq data, causing me agree that I expect Bowtie2 to have a similar problem (instead of a much better alignment).

However, I also not don't what is causing the issue you are encountering, and that was the best thing that I could suggest.

1
Entering edit mode
2.3 years ago

You may want to consider using IGV to visualize the Bowtie2 alignment for some genes (like GAPDH).

For example, this might help answer questions like: Are the splice junctions well represented (in the Bowtie2 alignment)? Is there any evidence of substantial DNA contamination?

1
Entering edit mode
2.3 years ago

As far as I know, Tophat2 uses Bowtie2 in several stages. The initial stages (mapping to transcriptome and genome) are performed to get all the unmapped reads. The unmapped reads are then splitted and mapped separately to check if they span an intron.

How bowtie2 is parameterized in Tophat2 is likely to be very specific to Tophat2 usage. One thing that comes to mind is how soft/hard clipping is handled. In your Bowtie2 .bam file, do you have a lot of soft clipped reads? Maybe those are discarded in the Tophat2 .bam files because clipping penalty is set higher?