I am trying to see if there is an increased number of anti-sense transcripts around my regions of interest. Unfortunately, determining what reads are 'anti-sense' is proving a bit difficult. I started by splitting my fragments (using the XS:A:+/- tag, not the read flags) to either the forward or reverse strand with the following code:
import os import sys import pysam inFile = sys.argv[-1] outFile = os.path.basename(inFile)[:-4] inputData = pysam.AlignmentFile(inFile, "rb") forwardReads = pysam.AlignmentFile(outFile + '.forward.bam', 'wb', template=inputData) reverseReads = pysam.AlignmentFile(outFile + '.reverse.bam', 'wb', template=inputData) for line in inputData: for tag in line.tags: if tag == 'XS': if tag == '+': forwardReads.write(line) else: reverseReads.write(line)
This gave me two bam files: one for the forward-strand reads and one for the reverse-strand reads. I then converted these bams to bedgraph with:
genomeCoverageBed -ibam ./C1.reverse.bam -bg > ./C1.reverse.bdg
and then viewed them in IGV and saw what I believed to be a clear anti-sense peak (I have three samples so there are 6 tracks):
So to call anti-sense peaks like this, I took a genome genes.bed file (GRCm38) and separated it into two files - genes on the forward strand and genes on the reverse strand, with the + and - strand column respectively. The next step was to simply intersect forward-strand genes with reverse-strand reads, and visa versa, and then merge these two intersects together:
bedtools intersect -wa -a C1.forward.bdg -b genesReverse.bed > C1.forward.antisense.bdg bedtools intersect -wa -a C1.reverse.bdg -b genesForward.bed > C1.reverse.antisense.bdg cat C1.forward.antisense.bdg C1.reverse.antisense.bdg > C1.antisense.bdg sort -k1,1 -k2,2n C1.antisense.bdg -o C1.antisense.bdg
But doing that gave me the unexpected result of:
This is because, and if you had a keen eye before you might have noticed it, the gene bed file has the same region being transcribed in both directions. So even though my forward-strand peak is still there, the vastly more reads on the reverse strand have overwhelmed it.
So sure, I could try a different
genes.bed file which doesn't have this problem for my region - but I feel like this is not a solution, just a work around for this specific instance. I think it would be better to call antisense reads some other way entirely - perhaps by looking at the ratio of forward/reverse reads? Looking for tell-tale peak shapes? Etc.
If anyone has any suggestions I'd be very grateful! :)