I think we always could come arcoss large files, say, fastq files. Now I have really super large fastq files, around 10GB. I need to first split it into some smaller files. I've written some python script to do this, with the algorithm of taking the whole file into memory (similar to readlines(), or zcat file), thus leading to insufficiency of memory (60GB memory on cluster node, but still not enough)
Just wondering, is there any algorithm which doesn't take the whole file, but read line by line? Anyone wanna share any script for splitting?
BTW, I'm not doing BWA paired-end mapping; but curious is it possible to run BWA with an input file with size around 10GB? thx
Actually I'm using python to split. One of the key command here is:
input = commands.getoutput('zcat ' + fastqfile).splitlines(True)
Seems a bit faster than readlines()
; but basically the idea is still to create "list" or in perl called "array". Then I can manipulate specific line of the list, say list[1000] (the 1000th line)
I've successfully run BWA with files over 20 GB (compressed) in size.
Also, what are you trying to do? 60GB is more than enough for most alignment needs unless you have a really large genome, whereas it may fall short for assembly and splitting your reads would not help much here.
I've successfully run BWA on compressed fastq files over 20GB in size (around 40 GB uncompressed) on a machine with 18GB of RAM.
python has a gzip module--no need to the zcat call.