Tophat problem - R1 and R2 reads mapped to different strandeds of viral genome
2
0
Entering edit mode
3.6 years ago
WZOUXM ▴ 10

We recently analyze a set of RNA-seq data with Tophat and met some questions about the alignment.

We infected cells with virus and extracted total RNA samples for RNA-seq at different time points. Libraries were prepared with a TruSeq stranded total RNA sample with a Ribo‐Zero prep kit, and the RNA sequencing read type was 2 × 50 bp pair-end sequencing. There is no problem when we analyze the host transcriptome. However, when we analyze the viral transcriptome and mapping the reads to the virus genome using Tophat Version 2.1.1 under default parameters, we found the viral reads mapped to both viral positive and negative strands at the same region. A little bit background about the virus we used: the virus has double stranded circular DNA genome and both strands transcript viral genes but at different region. And we tried default parameters with all three different library types, the results are the same. So it seems like the R1 and R2 reads are simply mapped to the different strand of the virus genome without converting R1 strand information. According to this, I tried to only analyze the R1 reads or R2 reads, it gives me more reasonable results as expected.

Has somebody met this problem or could help me out this problem?

alignment software error RNA-Seq • 1.5k views
1
Entering edit mode

Hello WZOUXM ,

You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.

1
Entering edit mode
3.6 years ago

Don't use tophat2 any more.

R1 and R2 will always map to opposing strands given how Illumina sequencing works. Note however that orientation of read 2 dictates the strand from which the RNA that was sequenced arose. So if you have a bunch of R2 reads on the - strand and R1 on the + strand then gene that produced them is on the - strand. If you load things in IGV, you can color mates by strand, which should make things clearer.

0
Entering edit mode

I totally agree! Do not use TopHat(2)!

1
Entering edit mode
3.6 years ago

I almost didn't check this because I didn't think I could provide a response for your specific project. However, since I see many of the other responses relate to not using TopHat, I think it may be important to note that I strongly disagree with the conclusion that you shouldn't use TopHat anymore:

1) I know of at least one study (currently unpublished, as far as I know), where I needed to use TopHat to recover a known splicing event

2) I've had situations where the more conservative alignment and/or unaligned reads from TopHat were useful

3) For gene expression, if I see something potentially strange with a TopHat alignment, I have yet to find an example where the trend was noticeably different (there probably are some examples for some genes, but I would guess the gene expression quantification with htseq-count are usually similar for TopHat and STAR single-end alignments for most genes)

1
Entering edit mode

Also, I'm not sure if I completely understand your issue, but I would expect the R1 and R2 to be on different strands, and the trend for R1 versus R2 would vary depending upon your protocol (if it is unstranded, you would see a mix of R1 and R2 for both strands; if it is stranded, the strand should show a separate trend for R1 versus R2, unless there is approximately equal antisense transcription).

This should also be true for your host genome, so I'm not sure if I understand what is the unique problem for your viral reference. However, you could use a joint viral+host genome reference, if you encounter some issues with trying to align to only a small genome. This isn't the same issue that I encountered (since you at least have an alignment), but maybe knowing whether the reads uniquely align to the virus genome could be relevant?

0
Entering edit mode

Hi Charles, Thank you for your reply. The libraries were prepared with a TruSeq stranded total RNA sample with a Ribo‐Zero prep kit. I understand the R1 and R2 should show a separate trend.

But for my project, both strands of the double stranded virus genome transcript viral genes. So there are mRNA from both strands nearly at the same region of the viral genome. And the R1 and R2 reads mapped to the opposite strand of the genome. So, when I analyze the RNA-seq data and want to figure out which reads is for mRNA from positive strand and which reads is for mRNA from negative strand, I found out the reads coverage for positive and negative strand is the same (because if there is a R1 for positive strand, then there is a R2 for negative strand). And this is my problem.

I suppose when do analysis, the software (Tophat or HISAT2) should convert either R1 or R2 orientation because the mate pair reads (R1 and R2) are for a single mRNA (no matter this mRNA is from positive genome or negative genome))

I hope I make myself clear and really need help!!!!!

0
Entering edit mode

Actually, if I only analyze single-end reads (either R1 or R2), then I did not see the problem. And I get reads for both positive and negative strands.

1
Entering edit mode

If you can use R1 separately and get reads on both strands, that sounds like a stranded library. For stranded RNA-Seq libraries, the read is usually on the reverse strand of the original transcript (and the conversion may be 98%, with some opposite strands). For TopHat and HISAT, you are supposed to know the strandedness of your samples prior to alignment.

While I unfortunately can't tell if this solves your problem, RSeQC has a way to calculate the strandedness of your protocol using infer_experiment.py (I would usually recommend this for housekeeping genes, but you could compare the results for some housekeeping host genes versus the viral genes): http://rseqc.sourceforge.net/

1
Entering edit mode

Hi, thank you all for response. Actually I agree with you. I tried Hisat using the following command to analyze my data and give me the same results as Tophat. And I still can not find the reason.

hisat2 -p 7 -S 1.sam -x ref.fa -1 R1.fq.gz -2 R2.fq.gz

0
Entering edit mode

To be honest, I still don't think I understand the problem, but I'll still see if general troubleshooting advice is helpful.

While I don't think it is precise to say "you should never use TopHat," there are situations where I would test other aligners (if nothing else, that gives you more confidence in your earlier result, or at least indicates that issue is not unique for the aligner).

If you have already tried using a joint virus+host genome, perhaps you could try indexing your reference and testing a STAR alignment? If you verify that you have the same issue with TopHat2/HISAT2 and STAR, perhaps posting in IGV screenshot can help explain the original problem?