tophat aligns to wrong strand
2
0
Entering edit mode
7.9 years ago
jlmakofske • 0

I aligned strand-specific RNA-seq samples using tophat2 with default settings and then viewed the resulting "accepted_hits.bam" file in IGV. All the reads aligned antisense to genes. For example, at GAPDH (+ strand gene) all the reads align on the - strand (arrow on left of the read). For ACTB (- strand gene) all the reads align to the + strand (arrow on right of the read).

I figured this is because I used a dUTP first strand library prep protocol and that I would need to specify the prep protocol when aligning. So I ran tophat2 again, this time including the "--library-type fr-firststrand" parameter but the result comes out the same. When viewing the bam files in IGV, the reads are still antisense to genes.

Does anybody have advice on how to fix this?

This also doesn't seem to be an issue with the sequence somehow getting flipped mistakenly during a processing step because if I search for several stretches of GAPDH sequence in the raw fastq file, I won't find any reads having that sequence, but I will find many reads containing the reverse complement of that sequence.

So basically, the sequenced read represents the reverse complement of the actual RNA molecule. Is there a way to specify this during alignment so that the resulting bam shows the correct orientation (ie. reverse complement of the sequenced read)?

RNA-Seq tophat strand alignment dUTP • 3.5k views
ADD COMMENT
0
Entering edit mode

Nothing is wrong in your analysis so you also don't have to fix things. The 'reverse' orientation of the reads is just intrinsic to the library prep method you used and you shouldn't start to artificially tamper with that.

ADD REPLY
1
Entering edit mode
7.9 years ago

IGV and the bam file are already showing the correct orientation. You could, of course, reverse-complement all of the data prior to mapping, but that could have other impacts so I don't recommend it. Basically, just accept that you are using a protocol in which reads map to the antisense strand. When you do analysis, specify the strandedness appropriately, and everything will be fine.

ADD COMMENT
0
Entering edit mode

I have the same problem. The real problem is the overlapped and opposite-oriented genes. I think the results will change but for the better.

ADD REPLY
0
Entering edit mode

However, when I did reverse-complement to the fastq file, i got almost identical results.

ADD REPLY
0
Entering edit mode
7.9 years ago
apa@stowers ▴ 610

If I recall, use of "--library-type fr-firststrand" does not actually change the SAM bit flags. Instead, it adds the XS strand tag to the SAM record -- basically for Cufflinks to use. Non-Cufflinks programs will still read the alignments backwards.

To really fix it, you'll need to edit the bit flag, or maybe split the bam file into + and - versions and analyze accordingly.

ADD COMMENT
1
Entering edit mode

You're correct in that "--library-type fr-firststrand" does not change the required SAM fields. I just affects some Tuxedo-pipeline-specific custom tags.

However, it's not correct that non-Tuxedo pipelines will read alignments backwards. The alignments are already correct and should not be changed in any way; changing them as you suggest could violate the sam specification and result in incorrect data. Correct programs will, of course, process the files correctly regardless of custom Tuxedo fields, which they'll ignore, since they're completely superfluous.

The Tuxedo pipeline will process records incorrectly unless you add custom XS tags. That does not mean that XS is required; it means it's required by the Tuxedo pipeline. Other tools (such as DESeq and edgeR, which, incidentally, are also far more accurate than comparable Tuxedo tools) can process normal sam-compliant files just fine without requiring special custom extensions.

(jlmakofske) I encourage you to use software that is compatible with existing standards, rather than locking yourself into some custom format.

ADD REPLY
0
Entering edit mode

However, it's not correct that non-Tuxedo pipelines will read alignments backwards.

What I meant was that it reads the correct strand of the read backwards, sorry for my semantic glitch. The true strand of the RNA fragment is being reported wrong, because the library was made from rev-comped sequences. In the spirit of the information you are doing the protocol to retrieve, the alignment may as well be backwards.

changing them as you suggest could violate the sam specification and result in incorrect data

Kindly be specific on how it will break the format?

Although I don't so it often, I have edited BAMs in the past and experienced no downstream problems.

Worst comes to worst, just spin of a copy of the BAM to edit for visualization purposes.

Correct programs will, of course, process the files correctly regardless of custom Tuxedo fields, which they'll ignore, since they're completely superfluous

Processing the file "correctly" will still give biologically false results. Compensating for that is the basis for this thread.

ADD REPLY
0
Entering edit mode

What I meant was that it reads the correct strand of the read backwards, sorry for my semantic glitch. The true strand of the RNA fragment is being reported wrong, because the library was made from rev-comped sequences.

The true strand of the RNA fragment is not being reported. Yes, it's true that if you want your reads to look like the original RNA, and they are in fact the reverse-complement of it, then you can reverse-complement your reads to compensate for that, but it's not necessary; tools are designed to deal with this, and changing sam flag bits can confuse them. Bear in mind that paired reads map to opposite strands. Does that mean that paired reads are wrong? No, it just means you have to understand their meaning. I think flipping the strand bit on one read in each pair would probably not do good things for analysis pipelines, because they are already designed to compensate for this.

Kindly be specific on how it will break the format?

Although I don't so it often, I have edited BAMs in the past and experienced no downstream problems.

Let's say you have a read that maps somewhere to the minus strand, and somewhere else to the plus strand. So, for whatever reason, you decide to flip the bit on the primary alignment indicating that it mapped to the other strand; so now both of them map to the plus strand. Unfortunately, this means that the secondary alignment becomes completely incorrect - the bases are reverse-complemented and no longer match the cigar string, or really, anything other than the coordinates. How would, say, IGV display that? I have no idea. Maybe as 100% mismatches. What would a variant-caller do with it? Etc.

Processing the file "correctly" will still give biologically false results.

Can you explain that? The file is already correct. Obviously, processing a correct file correctly would give correct results. If incorrect results are generated, the file is being processed incorrectly. So if you are using a tool that gives biologically incorrect results for RNA-seq data that is reverse-complementary to the original RNA, then either the tool is faulty, or is being used incorrectly.

ADD REPLY
0
Entering edit mode

When switching strands in the bam file, you would need to switch all strands, globally -- otherwise you are right, mate pairs would suddenly appear as discordant. IGV, in particular will color discordant pairs differently, but that is all. Other software may or may not pay attention to the concordancy. So I am referring to global strand switching, which would eliminate the issue.

This is very easily tested, I have already done so. Try aligning a single read or read pair. Then revcomp the fastq, and align that with fr-firststrand. Compare BAM files -- aside from the Tophat XS tag, how do they differ? Solely in the SAM flags. All other fields are identical; read sequence and quality-string orientation are already canonicalized. By changing the bit flags, you convert one BAM into the other -- MD5SUMs identical. So no, the format is not broken; you simply undo the effect of having revcomped fastqs. I have tested this with both single and paired reads.

If you can find me an example where this does not hold or breaks some program, that would be very surprising given my tests, I would very much like to know about that!!!

By "biologically false" I mean that the read counts / browser visualization indicates that all the transcription for gene X is antisense, when in fact it is sense. Computational correctness of the output is only that; it does not reflect on the biology per se.

In any case, the original post is trying to resolve the reversed orientation in the genome browser. Since the BAM is being viewed directly and orientation comes from the flags, I think there are very few options to change this. IGV has a million settings, maybe there is something in there. However, I know my solution works. If you have a better solution, please post that.

ADD REPLY
0
Entering edit mode

It's easy to find one or more cases where a corrupted file does not impact downstream analysis. It's very difficult to prove that it can't. Since I've given you a specific instance in which your altered sam files can cause problems, it's really up to you to prove that it's impossible for that change to lead to incorrect results in any scenario, whether or not a program currently exists that demonstrates it.

Obviously, globally changing strands (based on what, the randomly-selected primary alignment?) won't accomplish anything, because there is no way to simultaneously ensure that all reads map to the "correct" strand, while maintaining correctness in the presence of ambiguity (such as two copies of a gene, one on the genome's plus strand, and one on the minus strand). In that scenario - which happens very often - you can flip the flag bits of all alignments, but it will accomplish nothing, because still half of the reads will be what you consider to be incorrect.

With paired reads, there is no point in swapping strands, anyway, since half of them will still be what you consider to be incorrect. It sounds like you are using software that requires read 1 to map to the plus strand for a given gene, which is very poor programming practice since it does not reflect reality. If so, I suggest you use something else.

I'm done with this thread and will not post further.

ADD REPLY

Login before adding your answer.

Traffic: 2118 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6