Have any of you worked with the excellent BBMap Toolbox? I have, and I love it, but I am not at all sure why this is happening. I am trying to remove human contaminants from genomic samples of a large conifer species using the method described by Brian Bushnell (author) here:
Why even do this? Well, our sequencing center is quite busy, and definitely used primarily for human data. I started to run this on the files in our project and checked in on "human.fq" for those files (the mapped reads, that are thus contaminants) and discovered that it had indeed wrote high certainty matches against the human genome. Cool, so I go to apply this to all of the files (rather than the three used as experiment), and I ran into a persistent problem.
So, I run the wrapper script that calls the mapping command on each of ~400 files, only to discover that each one fails after ~40 seconds with a characteristic, vague stack trace. So, here are the commands...
(1) Indexing the Masked Human genome (hg19, UCSC, as recommended), exactly one time, before launching the scripts:
bbmap.sh ref=/path/to/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz path=/path/to/databases/Hosa/hg19_masked/
(2) Run the script, dispatching jobs -- which creates this script for each set of paired end reads:
echo '#!/usr/bin/env bash bbmap.sh -Xms50G -Xmx54G minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/path/to/databases/Hosa/hg19_masked/ in=/path/to/clean/DDP2_S2_H2VF3_L003_R#_001.fastq outu=/path/to/clean_decontam/DDP2_S2_H2VF3_L003_R#_001.fastq outm=/path/to/clean_decontam/DDP2_S2_H2VF3_L003_R#_001.human_c.fastq statsfile=/path/to/clean_decontam/DDP2_S2_H2VF3_L003_R#_001.stats_c.txt usejni=t threads=11' | sbatch --job-name=BBMap_DDP2_S2_H2VF3_L003_Runs_001.fastq --time=0 --mem=64G --partition=bigmemm --ntasks=1 --cpus-per-task=12 --mail-type=END,FAIL --mail-user=[me]
I have indeed compiled jni, and didn't receive any warnings while doing so. This is the exact formatting -- so if whitespace is an issue, then I can replace all instances of whitespace characters (except the newline between bash header and command) with spaces. This doesn't seem like it would be a problem though. Sbatch is the dispatcher for slurm, which is our manager. I have opted to use 85% of 'physical memory' when setting the java heap size, where the physical memory is what is available to the job thanks to slurm. I am using one less thread than total to leave a little room I guess, since I don't really know the worker threads dispatched or the subprocesses under their control.
Okay, so I tried to change up the parameters a little, specifying the reference fasta (using ref=) and letting it write a new one for each file, and that didn't work. I tried specifying the reference fasta (using ref=) and letting it index the reference in memory (using nodisk), and that didn't work. Both of these specifications failed in the same way, and in the same way as the original method (as in the script above):
Here is one such failure, using the ref, nodisk way
java -Djava.library.path=/install/path/bbmap/jni/ -ea -Xmx54G -cp /install/path/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xms50G -Xmx54G minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 ref=/path/to/databases/Hosa/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz nodisk in=/path/to/clean/DDP2_S2_H2VC3_L007_R#_001.fastq outu=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.fastq outm=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.human_c.fastq statsfile=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.stats_c.txt usejni=t threads=11 Executing align2.BBMap [tipsearch=20, maxindel=80, minhits=2, bwr=0.18, bw=40, minratio=0.65, midpad=150, minscaf=50, quickmatch=t, rescuemismatches=15, rescuedist=800, maxsites=3, maxsites2=100, build=1, overwrite=true, fastareadlen=500, -Xms50G, -Xmx54G, minid=0.95, maxindel=3, bwr=0.16, bw=12, quickmatch, minhits=2, ref=/path/to/databases/Hosa/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz, nodisk, in=/path/to/clean/DDP2_S2_H2VC3_L007_R#_001.fastq, outu=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.fastq, outm=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.human_c.fastq, statsfile=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.stats_c.txt, usejni=t, threads=11] BBMap version 35.82 Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.650 Set threads to 11 Retaining first best site only for ambiguous mappings. Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.908 Executing dna.FastaToChromArrays2 [/path/to/databases/Hosa/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=50, midpad=150, startpad=8000, stoppad=8000, nodisk=true] Set genScaffoldInfo=true Exception in thread "main" java.lang.NullPointerException at align2.AbstractMapper.checkFiles(AbstractMapper.java:709) at align2.AbstractMapper.<init>(AbstractMapper.java:54) at align2.BBMap.<init>(BBMap.java:41) at align2.BBMap.main(BBMap.java:29)
So, what's going on? What am I doing wrong here? Thanks