In short: I use blast legacy (2.2.22) in qiime exclude_seqs_by_blast.py to decontaminate sequences comign from 16S rRNA microbial survey. I've decontaminated once a 1.92GB file on a computing cluster in less than an hour. File shrunk after decontamination to 1.88GB.
Now I experience some problems, I get segmentation errors, memory errors, and core dumps (see detailed section) when I try to decontamiante new sequences. So I wrote a script that splits fasta file into number of parts that I specify (utilizing perl script fasta-splitter), and then decontaminate files that weight around 15mb, pool "non-matching" sequences into one file - i.e. the decontaminated input file.
However I've noticed that a lot of sequences are "lost", i.e. match contaminating sequences, but I've felt that somethings is not entirely right.
I got back to original 1,92GB file, applied the same splitting approach, I expected to get the same output as I've done initially - excactly 1.88GB, but instead I get 466.8MB. Something is clearly not right.
I use blast legacy (2.2.22) in qiime script exclude_seqs_by_blast.py. I've checked the parameters of blast queries, they are both the same, i.e.:
'percent_aligned': 0.98 'verbose': False 'blastmatroot': None 'no_clean': False 'no_format_db': False 'subjectdb': '/Volumes/private/file/path/contaminants005.fa' 'working_dir': '/tmp/' 'wordsize': 28 'querydb': '/Volumes/private/file//fecal_filtered.part-005.fa' 'e_value': 1e-10 'outputdir': '/Volumes/private/file/path/decontaminated' 'max_hits': 100
It is confusing, and I don't know the source of the problem. I switched in the meantime from qiime 1.9.0 to 1.9.1., but computing cluster has all the versions available. I've tried going back to 1.9.0, but the effect is the same.
Is there a reason why splitting, "decontaminating", i.e. matching sequences from PCR "blanks", choosing non-matching sequences, and pooling those sequences from each file-part might be different than just doing this with one file?
Errors that I get now (I didn't get back when some software changes occured), are:
Traceback (most recent call last): File "/usr/local/analysis/qiime/1.9.1/bin/exclude_seqs_by_blast.py", line 267, in <module> main() File "/usr/local/analysis/qiime/1.9.1/bin/exclude_seqs_by_blast.py", line 206, in main options.percent_aligned, DEBUG=DEBUG) File "/usr/local/analysis/qiime/1.9.1/lib/python2.7/dist-packages/qiime/exclude_seqs_by_blast.py", line 118, in find_homologs curr_blast_result = BlastResult(raw_output_data) File "/usr/local/lib/python2.7/dist-packages/cogent-1.5.3-py2.7-linux-x86_64.egg/cogent/parse/blast.py", line 396, in __init__ hits.append(dict(zip(rec_data, h))) MemoryError
Core dumps (don't have log to show)
I used back then around 25/30GB of memory, but since memory errors started to occur, I've even submitted a job having 120GB of memory - to no avail.
I know legacy blast is not efficient/recommended for this, as it is slow. However, 52 minutes are acceptable for me. The sequences were untouched, so I am struggling with reproducibility here.