Getting No. of reads from BAM by pysam (ChIP-seq)
1
0
Entering edit mode
7.0 years ago
BioDH ▴ 10

I'm trying to get no. of reads form (TSS ~ +500) and (-500 to pA site) of each gene.

What I want to get is the no of reads of TSS proximal and pA proximal region(window = 500)

If I have a bed file that contains two genes, one plus strand and the other minus strand,

I was wondering if following script is what I want.

WT_bam=pysam.AlignmentFile("./WT.bam", "rb")
mut_bam=pysam.AlignmentFile("./mut.bam", "rb")


def getReads (chromosome, start, end, strand, window=500):
if (strand=="+"):
    WT_5end_reads=len(list(WT_bam.fetch(chromosome, start, start+window)))
    WT_3end_reads=len(list(WT_bam.fetch(chromosome, end-window, end)))
    mut_5end_reads=len(list(mut_bam.fetch(chromosome, start, start+window)))
    mut_3end_reads=len(list(mut_bam.fetch(chromosome, end-window, end)))
elif (strand=="-"):
    WT_3end_reads=len(list(WT_bam.fetch(chromosome, start, start+window)))
    WT_5end_reads=len(list(WT_bam.fetch(chromosome, end-window, end)))
    mut_3end_reads=len(list(mut_bam.fetch(chromosome, start, start+window)))
    mut_5end_reads=len(list(mut_bam.fetch(chromosome, end-window, end)))
.

. .

ex. following is the bed file. (first column is "chromosome", 2nd col is "start", 3rd col is "end", and the last column is "strand")

chrI    500    2000    .    .    +
chrII    300    4000    .    .    -

To make sure, I am showing a part of full code and the full code is working.

But I'm not sure that the code to get reads from minus strand(-) is what I want.

ChIP-Seq • 1.6k views
ADD COMMENT
3
Entering edit mode

Why don't you use a tool such as htseq-count or featureCounts to get the number of reads? You just need to construct an appropriate bed file.

ADD REPLY
1
Entering edit mode

In the likely event that you'd like to additionally plot the distribution of the reads in these regions, use computeMatrix and then plotHeatmap form deepTools.

ADD REPLY
2
Entering edit mode
7.0 years ago

FYI, you want something like WT_5ends_reads = WT_bam.count(chromosome, start, start + window). That will be vastly more efficient than using fetch().

Anyway, I agree with WouterDeCoster that it's much easier to use something that already exists (you can use bedtools to modify the BED file to contain only the regions you care about).

The code for the - strand appears correct (see my note above about count()). Give that a try and then double check a region or two manually. Then you'll be certain whether it's performing correctly.

ADD COMMENT

Login before adding your answer.

Traffic: 2219 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6