MultiProcessing on SeqIO biopython
0
0
Entering edit mode
4 months ago
mr.two • 0

Hello,

I would like to parse a wheat genome (13Gb) quickly, in order to cut each Sequence and count the fragment lengths and store it in a pandas dataframe.

Is it recommendable to use multiprocessing on the SeqIO.parse command? Does it save time?

Any experiences/recommendations of people having used it?

SeqIO Multiprocessing Python • 611 views
ADD COMMENT
1
Entering edit mode

It should be pretty easy to test this? How slow is it without parallelization with a few of the average size chromosomes you have available on the system you intend to use? Does that seem very long? And you could gauge what multiple of increase roughly you'd expect given how many cores you have available and how many different chromosomes you'd be using?

There's also pyfaidx which may allow you more speed than biopython. Maybe that combined with your multiprocessing approach would be fastest. You'd have to compare if efficiency is really your goal here.

I also noted these two examples:

The latter looks to have some adaptable code.

ADD REPLY
2
Entering edit mode

I'd strongly second this suggestion to check and see how long things actually take for you since the bottleneck might not be where you expect.

Last year I tried adding parallel processing to my Biopython usage to speed up some handling of fastq.gz files, only to find out that the decompression and fastq parsing was the slowest part for me, since my actual processing was pretty lightweight. You might find that just getting those 13 GB from disk into RAM is the bulk of the time and prepping your data frame is less the issue, depending on how much work that latter part is.

ADD REPLY
0
Entering edit mode

I have just 4GB, so RAM would not be an option.

I also have the impression that parallelising my program would not be making it efficient. I just split each sequence and then count the number of certain length of the resulting fragments into a pandasDataframe.

So would you think that getting more RAM would actually speed it up? Because using the Pool() operation with a shared object (dataframe) seems to be very difficult and I do not know whether it is worth it.

ADD REPLY
0
Entering edit mode

No, I think you understand it right-- you only need to load a single contig at a time so you're not really using all 13 GB at once (assuming you load a record, parse it, drop it, continue with the next record, etc). But what I mean is that the loading from disk itself might be the bottleneck, particularly since tabulating lengths is probably a lightweight operation. Is your genome gzipped, by any chance? If you're loading straight from fasta.gz it could be worth looking at different methods for loading, but if it's already plain text I doubt there's much that'd speed it up. You could check, say, how long it takes to get each contig loaded and then how much more time to do your processing. If the bulk of the time is spent in the first step, then you'd have to focus on speeding that part up, but again, if it's already just text there may not be much to be done there.

ADD REPLY
0
Entering edit mode

Oh I see!

I can make it gzipped. Then I'd use the gzip.open command with the SeqIO.parse() method. As far as I am concerned, this seems to be the fastest solution, when the memory is too low. I thought multiprocessing could speed up things for large files, as each seq could be evaluated by one core each.

Thanks a lot! :-)

Would you have an idea of what would be the fastest way to parse a fasta file in python?

ADD REPLY
0
Entering edit mode

I haven't tested it directly vs biopython, but I thought pyfaidx that I recommended was built for in part to be fast.

Oops, I see now that last question became its own thread, see here.

ADD REPLY

Login before adding your answer.

Traffic: 1526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6