Antisense transcription - how to detect it?
0
1
Entering edit mode
6.6 years ago
John 13k

Hello all!

I am trying to see if there is an increased number of anti-sense transcripts around my regions of interest.
Unfortunately, determining what reads are 'anti-sense' is proving a bit difficult. I started by splitting my fragments (using the XS:A:+/- tag, not the read flags) to either the forward or reverse strand with the following code:

import os
import sys
import pysam
inFile = sys.argv[-1]
outFile = os.path.basename(inFile)[:-4]
inputData = pysam.AlignmentFile(inFile, "rb")
forwardReads = pysam.AlignmentFile(outFile + '.forward.bam', 'wb', template=inputData)
reverseReads = pysam.AlignmentFile(outFile + '.reverse.bam', 'wb', template=inputData)
for line in inputData:
for tag in line.tags:
if tag[0] == 'XS':
else:             reverseReads.write(line)

This gave me two bam files: one for the forward-strand reads and one for the reverse-strand reads. I then converted these bams to bedgraph with:

genomeCoverageBed -ibam ./C1.reverse.bam -bg > ./C1.reverse.bdg

and then viewed them in IGV and saw what i believed to be a clear anti-sense peak (I have three samples so there are 6 tracks):

So to call anti-sense peaks like this, I took a genome genes.bed file (GRCm38) and separated it into two files - genes on the forward strand and genes on the reverse strand, with the + and - strand column respectively. The next step was to simply intersect forward-strand genes with reverse-strand reads, and visa versa, and then merge these two intersects together:

bedtools intersect -wa -a C1.forward.bdg -b genesReverse.bed > C1.forward.antisense.bdg
bedtools intersect -wa -a C1.reverse.bdg -b genesForward.bed > C1.reverse.antisense.bdg
cat C1.forward.antisense.bdg C1.reverse.antisense.bdg > C1.antisense.bdg
sort -k1,1 -k2,2n C1.antisense.bdg -o C1.antisense.bdg

But doing that gave me the unexpected result of:

This is because, and if you had a keen eye before you might have noticed it, the gene bed file has the same region being transcribed in both directions. So even though my forward-strand peak is still there, the vastly more reads on the reverse strand have overwhelmed it.

So sure, i could try a different genes.bed file which doesnt have this problem for my region - but i feel like this is not a solution, just a work around for this specific instance. I think it would be better to call antisense reads some other way entirely - perhaps by looking at the ratio of forward/reverse reads? Looking for tell-tale peak shapes? Etc.

If anyone has any suggestions i'd be very grateful! :)

RNA-Seq antisense • 4.2k views
1
Entering edit mode

I didn't look in detail at your procedure, but the region you are showing shows annotated overlapping transcripts on opposite strands. You can't say much about which peaks are coming from the source or an antisense transcript then, can you?

0
Entering edit mode

That is, unfortunately, exactly the problem im experiencing Michael :(
I would say that the gene thats active and true is that on the reverse strand, and the gene on the sense strand is not real (or not active) because there aren't enough reads over it in the sense direction. But I can tell this from looking at it - the question is how to do it programatically.

...hm, perhaps I should count reads in genes (the output of Kallisto or Salmon or something), take my "active" genes, and then use that as the bed file for finding antisense reads?

1
Entering edit mode

Difficult to see which transcript is 'real'; if the forward transcript is annotated, it is at least predicted by Ensembl. But mouse is well annotated, so chances that there is experimental evidence and that it is expressed in different tissues. Is there for example a refseq transcript for both? The peak you are seeing could also be an artifact, also there seems to be a peak at the same position and size in the reverse coverage. Not saying it is an artifact, but it might be hard to prove it's not.

1
Entering edit mode

Ah, I haven't even thought about mapping artefacts and how that could mess things up.. yes that could be a peak in the forward strand too, from some repeat region or palindromic repeat. I suppose I must also lay over a mappability track to catch that sort of thing.
Still, I think that if the forward-reading transcripts here (that are causing everything to be read as antisense) were actually being transcribed, we would see reads in the forward-direction all over their body and not just that peak.
This is where I think your point about which is the "real" transcript is very important, and can only be answered by taking all these annotations, the raw read data, and a strand-aware RNA-Seq feature count tool like Salmon/etc, and with the output of those programs telling us what features are likely to be actively transcribed build another gene annotation .bed file for the above anti-sense pipeline. That should get rid of the forward-transcribed features above because there aren't enough reads to support their transcriptional existence...

Why is nothing ever easy eh...

0
Entering edit mode

I'd always use strand aware read counting for strand specific protocols. Further, I would only allow reads that do not multi-match for this purpose. I think HTseq and easyRNAseq in R have strand aware counting. In the aligner I would try to switch of multi-mapping for this purpose, or filter the BAMs that you have.

1
Entering edit mode

I could whip up something using libGTF that should be able to process stuff like this easily tomorrow or later this week. What you probably want to do is:

1. See if a pair intersect with the sense strand of the GTF file.
2. If not, see if it intersect with the antisense strand.
3. Write to a file accordingly.
4. Call peaks or whatever
5. ?
6. profit/publish

If that's useful then just let me know and I can throw something together for you.

0
Entering edit mode

> implying I will ever publish
Heheh - libGTF looks really polished Devon - nice work man! :D
I will check it out and try to solve the above, but I again think it's going to require a GTF or bed file of annotations filtered on 'real' annotations as per the output of a Salmon/Kallisto like tool, otherwise, much like the ENCODE project, I will end up reporting that 99.9% of transcription is actually antisense, heheh.

(just kidding ENCODE people, I know you never actually said anything like that)

0
Entering edit mode

my guess is if you don't have strand specific RNA-seq, you can't find antisense transcripts

0
Entering edit mode

Oh, i do have strand-specific RNA-Seq - it's just that many many many genes are annotated as being transcribed in both directions, presumably as an artifact from non-stranded RNA-Seq being used to create the reference transcriptome.

Coming back to this post after more than a year, I don't think i explained my problem very well. In simple terms, how do you detect antisense transcription if your reference transcriptome says many genes can be transcribed in both directions? Looking at the data with human eyes, it's clear when you get a peak of anti-sense transcription, but how to do this programatically if a reference transcriptome cannot be reliably used to tell you the direction of sense transcription in your specific cell type?

0
Entering edit mode

What kind of gene prediction did you run? I would guess that it is rare that transcripts have an equally good ORF in both directions, unless there are two exactly overlapping genes on opposite strands. In your example it is mouse, it would surprise me if this genome would contain so many undirected genes/transcripts. I would propose to fix the annotation right away, before going for sense-antisense transcripts.