I'm trying to get no. of reads form (TSS ~ +500) and (-500 to pA site) of each gene.
What I want to get is the no of reads of TSS proximal and pA proximal region(window = 500)
If I have a bed file that contains two genes, one plus strand and the other minus strand,
I was wondering if following script is what I want.
WT_bam=pysam.AlignmentFile("./WT.bam", "rb")
mut_bam=pysam.AlignmentFile("./mut.bam", "rb")
def getReads (chromosome, start, end, strand, window=500):
if (strand=="+"):
WT_5end_reads=len(list(WT_bam.fetch(chromosome, start, start+window)))
WT_3end_reads=len(list(WT_bam.fetch(chromosome, end-window, end)))
mut_5end_reads=len(list(mut_bam.fetch(chromosome, start, start+window)))
mut_3end_reads=len(list(mut_bam.fetch(chromosome, end-window, end)))
elif (strand=="-"):
WT_3end_reads=len(list(WT_bam.fetch(chromosome, start, start+window)))
WT_5end_reads=len(list(WT_bam.fetch(chromosome, end-window, end)))
mut_3end_reads=len(list(mut_bam.fetch(chromosome, start, start+window)))
mut_5end_reads=len(list(mut_bam.fetch(chromosome, end-window, end)))
.
. .
ex. following is the bed file. (first column is "chromosome", 2nd col is "start", 3rd col is "end", and the last column is "strand")
chrI 500 2000 . . +
chrII 300 4000 . . -
To make sure, I am showing a part of full code and the full code is working.
But I'm not sure that the code to get reads from minus strand(-) is what I want.
Why don't you use a tool such as htseq-count or featureCounts to get the number of reads? You just need to construct an appropriate bed file.
In the likely event that you'd like to additionally plot the distribution of the reads in these regions, use computeMatrix and then plotHeatmap form deepTools.