I'm using sortmerna to remove rRNA sequences from published microbiome metatranscriptomics data. The data are paired reads, all downloaded from ENA and associated with the same experiment. They range in size though--the uncompressed files range from <200 MB in size to >4 GB in size, for either forward or reverse reads, with generally both files of the same pair being similar-sized, with a few exceptions (I'm not sure why the data volume varies so much sample to sample in the same experiment, but that's just how it is).
For some of the pairs, sortmerna runs fine and produces output files of aligned and unaligned reads. However, for others it fails in a peculiar way. It reads the reference and the two read files without any errors or warnings, counts the reads, and gets to the splitting (i.e. the console lines that starts with "[split]"). However, before it does the actual alignment (i.e. the console output lines that start with "[align]"), including spawning the processor threads, it simply quits and returns to the command prompt--no error message or anything--and without writing any output into the "out" folder.
I haven't looked through the read fastq files in a large text editor to see if there are any visible corruptions, but I don't expect there are, because the read counts sortmerna reports are in the millions as expected, and match for forward and reverse read files of the pair. It's not like the files that fail don't have data. And as I said, they're from the same experiment, and are being matched against the exact same reference db--the commands entered at the prompt are identical except for the read file names. In fact, the only pattern I notice is that the files that fail are larger than the ones that succeed--it seems that whenever the combined size of the two read files exceeds 4 GB (i.e., when each individually exceeds 2 GB), it fails. This makes me suspect a memory issue that for some reason the program doesn't actually report, but just responds to by quitting.
The software is being run on a laptop with 16 GB of RAM. Thus, the files should fit--and even if they didn't, the user manual claims that they will be read in chunks to make them fit. Dividing the work over more threads (say, 8 vs. the default 2) does make it run substantially faster for the smaller files where it DOES actually do the alignment, but doesn't stop it quitting early on the files it doesn't.
I was also thinking about hard drive space. The files are on an external 2 TB drive that is less than 1/4 full, so plenty of space there. The working directory was originally set to be on the main hard drive that has fewer than 30 GB free, and I thought maybe it needs to create many times the size of the original data worth of temp files to run, but alas, moving the working directory to the external drive didn't keep it from terminating early.
Has anyone else run into this same problem?
Pls add the output and your system (especially RAM), no one here can guess.