I happen to have a small problem which seemed to be trivial at first, but keeps me busy for a while now already. Maybe you can help anew, because most Biostars posts on this issue are quite outdated:
I need to split 615M paired reads currently in two FastQ files (forward and reversed reads separate) into two file pairs with 308M reads.
Solution attempt A:
I unsuccessfully tried to use line count based tools like
awk, but the
wc -l does not even count them correctly (line counts differ), although they are processed nicely by several tools. By googling, I found that a possible reason for this behavior might be that newline characters may occur in the quality scores, which thus hinder the core utilities.
Solution attempt B:
I tried to use my
BBTools, my favorite Swiss army knife of everything sequence related, but...
bbmap/reformat.sh in=... in2=... out=... out2=... reads=308000000 bbmap/reformat.sh in=... in2=... out=... out2=... skipreads=308000000
Input is being processed as paired Input: 615307122 reads 91326404105 bases Output: 615307122 reads (100.00%) 91326404105 bases (100.00%)
Input is being processed as paired Input: 615307122 reads 91326404105 bases Output: 0 reads (0.00%) 0 bases (0.00%)
for the second command, such that I got a copy of my data in the first case and an empty file in the second.
Solution attempt C:
Here on Biostars, somebody had suggested
famas for a similar problem, so I installed and tried it:
famas --in=... --in2=... --out=...XXXXXX.fq.gz --out2=...XXXXXX.fq.gz -x 308000000
However, the program flooded to output directory with thousands of subfiles files (instead of the actually needed two files each) until the file system couldn't cope with the number of open files anymore and ran out of file descriptors.
ERROR(famas.c|open_output_one:1056): Couldn't open =...compressed.065534.fq.gz ERROR(famas.c|main:1163): Couldn't open output files. Exiting...
In case somebody is wondering why I mixed the long and short notation here: the parameter --split-every=308000000 was not recognized (unknown parameter).
Since it took me a while to clean that mess up again, I am somewhat reluctant to try out more. Does sombeody have any ideas where I screwed up the commands or has suggestions which tools work better?
Thanks a lot for reading, pondering and help!