I have written a function that builds a 3D matrix from a bamfile, doing a pile up from bed file coordinates as the midpoint and region, with X being genomic position and y being size of the sequenced template.
However, I feel like it's heavy and could be more efficient, but I can't work out a way to trim it down. At the moment it gets the region from the bed and fetches the relevant samfile reads and builds it up. I've considered reading the sam just once and looping through the bed for each read but that seems like more processing...?
Is there any way to improve this, or is it just as good as it's going to get because it's such a heavy comparison.
import pysam def multiRegionSimpleExtract(globalX,globalY,samfile): # Open bedfile with open(args.b, 'rb') as f: for line in f: b1 = line.split(':') chrom = b1 b2 = b1.split('-') midpoint = (int(b2) + int(b2)) / 2 # args.e is a range integer, here saying go 500bp up and downstream regionRange = [midpoint - args.e, midpoint + args.e] for read in samfile.fetch(chrom,regionRange,regionRange): pos = int(read.reference_start) size = int(abs(read.template_length)) # For each bp from start to end, add one to the matrix while pos <= read.reference_end: globalX.append(pos - midpoint) globalY.append(size) pos += 1
At the moment I can pileup ~280 bed regions with two ~100million read (indexed) bam files with:
But if I use the ~1900 that I want to then I get to around 2 minutes and starts becoming unwieldy.
Ideally, I'd want to be able to feed it even more though. Any suggestions apprecieated.