Diverging read alignment and counts in TopHat vs. HISAT2
1
1
Entering edit mode
4.6 years ago
Thomas ▴ 30

Hi Everyone,

We have commissioned RNA-seq and analysis by a company, which provided us with raw fastq files, BAM files, and a count matrix. They used hard clipping and Tophat for the alignment to GRCm38/Mm10. I have attempted to recreate their analysis with HISAT2 (same reference genome), using simply the default parameters and no separate trimming/clipping. I have used samtools to convert the SAM files to BAM files and compared the results from the company's analysis ("Tophat") with my own ("HISAT2") using IGV. The results are very confusing to me. The majority of genes I have (randomly) inspected look highly similar between both sets of BAM files. See this example gene (Tophat in blue, HISAT2 in red):

Tnf

So far, so good. However, there are also multiple instance where one analysis picked up good reads, while the other did not. This is true in both directions. See these two example genes:

Gapdh

Il12b

And, finally, there are some genes in which one alignment just looks weirdly skewed. For instance:

Ubc

Does anyone know what might account for these differences? Or which alignment I should use for downstream analysis? I'd be grateful for any feedback!

Thomas

RNA-Seq genome alignment • 1.7k views
ADD COMMENT
0
Entering edit mode

Please consider this.

You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.

ADD REPLY
0
Entering edit mode

But even with that in mind, it's still odd that something like Gapdh (non-repetitive sequence that is well annotated) would be so different.

ADD REPLY
0
Entering edit mode

Have a look at GAPDH in the UCSC genome browser, don't forget to activate the track with segmental duplications ("Segmental Dups").

ADD REPLY
0
Entering edit mode

It's clean for mm10 (I am assuming this is mouse based on the genes in the screenshot).

ADD REPLY
0
Entering edit mode

Yes, it's GRCm38/Mm10. I should have mentioned that (just edited the original entry to include the information). So you think this is an acceptable level of quality for Mm10 alignments?

ADD REPLY
0
Entering edit mode

There is not enough information in the original post to draw a solid conclusion. Yes there appear to be differences but we have no idea if one or both analyses have some problems associated with them.

ADD REPLY
0
Entering edit mode

Right, exactly my problem! I can't tell if either of these is better than the other, if one is seriously flawed or if both of them are acceptable. What additional information would be helpful in determining this?

ADD REPLY
2
Entering edit mode
4.6 years ago
Thomas ▴ 30

Apparently, this problem is inherent to IGV. When we ran the same files that showed no reads on a different operating system, everything looked just fine. There are, of course, still differences between the Tophat and the HISAT2 alignments, but nothing is missing altogether. I will try to switch to the UCSC Genome Browser, as suggested by WouterDeCoster. Thank you all for your input!

ADD COMMENT
0
Entering edit mode

IGV has certain defaults about what data it shows and how much of the data it shows. You could look into those under preferences. There are many things you can change and perhaps there are differences in sam tags that are causing this odd display behavior. Something to consider.

ADD REPLY

Login before adding your answer.

Traffic: 1659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6