Memory leak while reading a bigwig in python
4
0
Entering edit mode
7.2 years ago

Hi there!

I'm trying to code a function that reads a bigwig file and calculate the mean score of a given genomic interval. This is my current function:

from ngslib import BigWigFile

def bigwig_mean(bigwig, chr, start, end):

bw=BigWigFile(bigwig)

score_sum = 0
mean_score = 0

    for i in bw.fetch(chrom=chr,start=start,stop=end):

score_sum += i.score

if (end-start) != 0:

mean_score = score_sum/(end-start)

else:
mean_score = 0

return mean_score

bw.close()

This function works, but each time that it reads the bigwig file with:

for i in bw.fetch(chrom=chr,start=start,stop=end)

Memory Ram is permanently used, and not released after "bw.close()". As I need to use this function for a large set of genomic intervals, I get out of memory before finishing the task. There is a memory leak in this function and would be great if any of you can tell me what's wrong or give me an alternative idea of how to code this function.

Cheers,

===UPDATE===

Here I found the original documentation of ngslib:

https://ngslib.appspot.com/BigWigFile

bigwig python • 3.0k views
0
Entering edit mode

I am not sure how the fetch method is implemented, but if it is a generator then you could yield from it. If it is not, then you can modify it to make it so using yield. Look here for some info http://stackoverflow.com/questions/9708902/in-practice-what-are-the-main-uses-for-the-new-yield-from-syntax-in-python-3#9709131

3
Entering edit mode
7.2 years ago
tszn1984 ▴ 100

Thank you for using NGSlib. This package is developed in my spare time so I am sorry that I don't have enough time to maintain it and update regularly.

The previous answers were correct about when to close the BigWigFile. But this is not the reason for your problem. The file handle doesn't take a lot of memory, and the memory will be released when the bw object is destroyed when it is out of its scope.  See BigWigFile.__exit__().

The answers provided by joe.cornish826 and et. at. were correct. The problem you mentioned is not a problem of the code. It's the memory usage strategy of Python. When you load a whole chrom  from a huge file into memory, it is considered as a big memory block and is reversed for Python to use it later. The memory will be released when the python script is finished. You may read related topics on the website. This is a common problem for advanced language. Most of the time, you don't need to care about the memory releasing since Python will do it for you. But you may have problem in some extreme cases.

The BigWigFile is actually a wrapper of Kent's UCSC codes. It is used for fast retrieval of items or depth in a short genomic region, not for the whole chromosome (use bash to do whole genome).  When you want to go through the whole file, there are many other strategies. The recommended strategy are:

1. If you really want to fetch the the whole genome, you may use bash.

>BigWigToWig test.bw | yourscript_to_catch_standard_output_as_input.py

or covert it to wig file and manipulate on the wig file.

2. Put the fetching into a single script. When it is finished, Python will release the memory. Do not do multiple fetching in the same script.

3. Split your chromosome into bins (10K for example), fetch a bin at each time.

4. Follow codes in external/KentLib/wWigIO/wWigIO.c, find a way to call it in C instead of Python.

By the way, if you are going to get the scores at each position instead of all the items, you may use the pileup function. It returns a numpy array with length = end-start. Remember to split the chromosome into bins, otherwise the problem will still be there.

Hope this helps.

1
Entering edit mode

If I understand correctly, the problem is because the list of intervals being allocated grows too large in memory.

As the library author, have you considered adding an iterator interface for .getIntervals function?

0
Entering edit mode

A problem needs to be noted when using chunk fetching of BigWigFile. I am updating the NGSLib for this (1.1.12). An example wig file.

0   50  3

50   100  5

100 150  4

Since the bigwig file has bin size (for example 50bp), when you fetch all the items in a chunk of 50 but started from 30: 30 to 80, it returns all the overlapped bins: 0-50 and 50-100. then when you do fetch 80~130, you get 50-100 and 100-150. the bin 50-100 is double counted. You need to either remove the ones overlapped with the boundary in the later fetch, or modify the wig start and end positions accordingly.

The current solution is : (specify the end position instead of None.)

end = bw.sizes[bw.chroms.index[chrom]] # get the chromosome length.

def getwig(bw,chrom,start,end,chunksize=1000):
for cstart in xrange(start,end,chunksize):
cend = cstart + chunksize
if cend > stop: cend = end
for wig in wWigIO.getIntervals(self.infile,chrom,cstart,cend):
if wig[0] < cstart: wig[0] = cstart
if wig[1] > cend: wig[1] = cend
yield wig

Alternative solution: Instead of doing that, you may you BigWigFile.pileup function to get the all the depth from 30-80, then 80-130. this will yield a numpy object with chunksize each time, except the last one.

for cstart in xrange(start,end,chunksize):
cend = min (start+chunksize, end)
yield bw.pileup(chrom,cstart,cend)

Hope this helps.

2
Entering edit mode
7.2 years ago

Instead of using Python, you could use BEDOPS bedmap and wig2bed with bash process substitution to get the --mean signal over some ad-hoc interval:

$bigWigToWig signal.bigWig signal.wig$ echo -e 'chr1\t1234\t3456' | bedmap --echo --mean - <(wig2bed < signal.wig) > answer.bed

In this case, answer.bed contains mean signal value over the ad-hoc interval "chr1\t1234\t3456".

If you have a sorted BED file of intervals (intervals.bed) over which you want mean signal:

$bigWigToWig signal.bigWig signal.wig$ bedmap --echo --mean intervals.bed <(wig2bed < signal.wig) > answer.bed

In this case, answer.bed contains mean signal values over each provided interval.

One more option, perhaps. Command-line options are fast, concise and not dependent on the vagaries of Python garbage collection.

0
Entering edit mode

I think this might be the best option. It doesn't sound like OP's usage case meshes all that well with what the NGS lib was designed for.

1
Entering edit mode
7.2 years ago

Well, I have never used (or heard of) ngslib, and its documentation is just amazing, but you are calling bw.close() after return, meaning it will never get executed -- maybe that's the problem.

EDIT: In theory, the correct pythonic way to write your code is:

bw = BigWigFile(bigWig)

try:
# stuff goes here
return mean_score
finally:
bw.close()

This way the bw.close() command will be executed both whenever there is an exception raised in the try block (and thus cleaning up), and after the return command is executed (which is also nice for you).

If the library developers implemented the context manager support (and I doubt it, but nevertheless including this for completeness) even better way to write this would be:

with BigWigFile(bigWig) as bw:
# do work
return mean_score
# bw.close() should be called automatically when bw is out of scope.
0
Entering edit mode

Thanks for your answere Saulius. I tried both ways, but the result is the same. I found documentation of the library and I posted in the update of my question. I hope you can take a look to it because I'm totally stacked.

0
Entering edit mode

It seems like you're using the program according to the documentation. I do not think there is anything magic you can do here to force it to use less memory.

The only thing you can try is to work in smaller chunks (beware: untested and rather ugly code). Double check you get the correct answer using the old and new versions (i.e. depending on how fetch works it may double count boundary reads). Might want to ensure chunk_size is at least equal to end-start otherwise this will not work, by the way.

from ngslib import BigWigFile

def _chunked_sum(bw_handle, chr, start, end, chunk_size=100):
n_chunks = (end-start) / chunk_size + 1 if (end-start) % chunk_size != 0 else 0

sum_ = 0
for chunk_start, chunk_end in [start + i*chunk_size, start + (i+1)*chunk_size in xrange(num_chunks)]:
sum_ += sum([i.score for i in bw.fetch(chrom=chr, start=chunk_start, stop=chunk_end])
return sum_

def bigwig_mean(bigwig, chr, start, end):

bw=BigWigFile(bigwig)
try:
score_sum = _chunked_sum(bw, chr, start, end)

if (end-start) != 0:
mean_score = score_sum/(end-start)
else:
mean_score = 0

return mean_score
finally:
bw.close()

1
Entering edit mode
7.2 years ago
pld 5.0k

Python doesn't release the memory it has allocated after whatever was located in that memory has been deleted. This isn't exactly true, but the details aren't as important here.

http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm

However, this memory is still available within Python. Basically by holding onto already allocated pieces of memory, the interpreter can just fill them back up again rather than immediately releasing them to the OS and having to allocate them. So it isn't like that memory is still consumed and can't be used again. The idea is to prevent fragmenting the memory and remove the overhead of calls to malloc and free the interpreter would have to make.

http://www.evanjones.ca/memoryallocator/

Generally when you get a segfault with Python it is because you're out of memory.

Your script shows you opening and closing the file each time your function is called, I would avoid this. It would be better to pass the opened file or use the 'with open...' syntax. You could also pass a list of intervals to your function. This doesn't solve the memory issue however.

Without any documentation for the function BigWigFile, I can't really tell what it is doing. I'm guessing that it is reading the whole file into memory. Unless you can change it to use a generator, you might be in trouble.

You could try splitting the file up and then for each part of the file you parse the genomic intervals associated with it.

1
Entering edit mode

Just to clarify this, as I think your response might be confusing.

This is true for python objects as their memory allocation/deallocation is handled by python.  In 9 cases out of 10 you do not need to worry about this, though, because, as the comment author remarked, the memory is available to python.

The object OP is working with, however, is a C object it seems. Python cannot guarantee all memory allocated for that object will be deallocated  -- the object needs to clean up after itself. It might very well happen that the developers of the library did not implement the correct behaviour to run when python destroys the object, and it might as well be that whilst the python object is destroyed, the mallocs in C cause a memory leak, because the .close() function is never called (even though it should, in a perfect world). Therefore one needs to do the .close() operation him/herself. This try.. finally.. patterns is quite common when dealing with IO resources, and it is a good practice to employ it. Even if the behaviour is implemented correctly, the .close() should be immediate memory drop, provided it does what I think it does.

By the way, there's nothing intrinsically wrong with opening/closing files multiple times. Especially from memory point of view. The problem comes from the design view, you always need to ask should your function know how to open a file, or should it merely know how to process it, once it has been opened? The latter allows you later to pass something that just looks like a file, but is not necessarily one that's physically on your disk (helpful in certain scenarios).

0
Entering edit mode

I was looking at the wrong website for the ngslib documentation. However, once a python object is created, the python interpreter will start allocating memory. It is possible that the memory allocated by C haven't been accounted for properly, but it is important to remember the disconnect between how much memory the python interpreter has allocated versus how much is being "used".

I was more or less suggesting the same approach you made in your post, either using the 'with' syntax, or doing something effectively the same if the author hadn't implemented everything needed for 'with'. Yes, I realize there are cases where you want a function to know how to open a file, but in OP's case it doesn't seem like it makes sense to open and close the file for each interval they want to collect stats on, especially if they have a large amount of intervals to process.

EDIT:

Just to clarify, it sounds like OP was running out of memory as expected.