BBMap NullPointer during initialization
1
0
Entering edit mode
5.8 years ago
cacampbell ▴ 50

Have any of you worked with the excellent BBMap Toolbox? I have, and I love it, but I am not at all sure why this is happening. I am trying to remove human contaminants from genomic samples of a large conifer species using the method described by Brian Bushnell (author) here:

Why even do this? Well, our sequencing center is quite busy, and definitely used primarily for human data. I *started* to run this on the files in our project and checked in on "human.fq" for those files (the mapped reads, that are thus contaminants) and discovered that it had indeed wrote high certainty matches against the human genome. Cool, so I go to apply this to *all of the files* (rather than the three used as experiment), and I ran into a persistent problem.

So, I run the wrapper script that calls the mapping command on *each* of ~400 files, only to discover that each one fails after ~40 seconds with a characteristic, vague stack trace. So, here are the commands...

(1) Indexing the Masked Human genome (hg19, UCSC, as recommended), exactly one time, before launching the scripts:

    bbmap.sh ref=/path/to/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz path=/path/to/databases/Hosa/hg19_masked/

(2) Run the script, dispatching jobs -- which creates this script for each set of paired end reads:

echo '#!/usr/bin/env bash

bbmap.sh -Xms50G -Xmx54G minid=0.95 maxindel=3             bwr=0.16 bw=12 quickmatch fast minhits=2 path=/path/to/databases/Hosa/hg19_masked/ in=/path/to/clean/DDP2_S2_H2VF3_L003_R#_001.fastq             outu=/path/to/clean_decontam/DDP2_S2_H2VF3_L003_R#_001.fastq outm=/path/to/clean_decontam/DDP2_S2_H2VF3_L003_R#_001.human_c.fastq statsfile=/path/to/clean_decontam/DDP2_S2_H2VF3_L003_R#_001.stats_c.txt usejni=t             threads=11' | sbatch --job-name=BBMap_DDP2_S2_H2VF3_L003_Runs_001.fastq --time=0 --mem=64G --partition=bigmemm --ntasks=1 --cpus-per-task=12 --mail-type=END,FAIL --mail-user=[me]    

I have indeed compiled jni, and didn't receive any warnings while doing so. This is the *exact* formatting -- so if whitespace is an issue, then I can replace all instances of whitespace characters (except the newline between bash header and command) with spaces. This doesn't seem like it would be a problem though. Sbatch is the dispatcher for slurm, which is our manager. I have opted to use 85% of 'physical memory' when setting the java heap size, where the physical memory is what is available to the job thanks to slurm. I am using one less thread than total to leave a little room I guess, since I don't really know the worker threads dispatched or the subprocesses under their control.

Okay, so I tried to change up the parameters a little, specifying the reference fasta (using ref=) and letting it write a new one for each file, and that didn't work. I tried specifying the reference fasta (using ref=) and letting it index the reference in memory (using nodisk), and that didn't work. Both of these specifications failed in the same way, and in the same way as the original method (as in the script above):

Here is one such failure, using the ref, nodisk way

java -Djava.library.path=/install/path/bbmap/jni/ -ea -Xmx54G -cp /install/path/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xms50G -Xmx54G minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 ref=/path/to/databases/Hosa/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz nodisk in=/path/to/clean/DDP2_S2_H2VC3_L007_R#_001.fastq outu=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.fastq outm=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.human_c.fastq statsfile=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.stats_c.txt usejni=t threads=11

Executing align2.BBMap [tipsearch=20, maxindel=80, minhits=2, bwr=0.18, bw=40, minratio=0.65, midpad=150, minscaf=50, quickmatch=t, rescuemismatches=15, rescuedist=800, maxsites=3, maxsites2=100, build=1, overwrite=true, fastareadlen=500, -Xms50G, -Xmx54G, minid=0.95, maxindel=3, bwr=0.16, bw=12, quickmatch, minhits=2, ref=/path/to/databases/Hosa/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz, nodisk, in=/path/to/clean/DDP2_S2_H2VC3_L007_R#_001.fastq, outu=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.fastq, outm=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.human_c.fastq, statsfile=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.stats_c.txt, usejni=t, threads=11]

BBMap version 35.82

Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.650

Retaining first best site only for ambiguous mappings.

Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.908

Set genScaffoldInfo=true

at align2.AbstractMapper.checkFiles(AbstractMapper.java:709)

at align2.AbstractMapper.<init>(AbstractMapper.java:54)

at align2.BBMap.<init>(BBMap.java:41)

at align2.BBMap.main(BBMap.java:29)

So, what's going on? What am I doing wrong here? Thanks

BBMap • 2.0k views
0
Entering edit mode

0
Entering edit mode

Ahh, looks like there is a bug there were it is doing the # symbol replacement. Thanks for finding that! I'll fix it in the next release.

For now, please use 1 and 2 instead of the # symbol for the outm and outu (e.g., outm1=matched1.fq out2=matched2.fq). Sorry about that.

-Brian

0
Entering edit mode

You're awesome thanks!

2
Entering edit mode
5.8 years ago
cacampbell ▴ 50

As noted by Brian in the comments, there is a bug with # substitution... but, if you are like me and can't wait for the next release and are just too stubborn to make a minor change to the command used in your script, then here is a workaround that weirdly enough worked for me.

Okay, so you are dispatching to your cluster, and you are using '#' notation, and you are using jni...

1. Replace every instance of whitespace characters with a space character (except the newline after bash header)
2. Use ref=[ref].fa.gz and nodisk options
3. Recompile jni using your version of the jdk
4. run the script, but make sure that java home is set to the jdk this time

No idea why this works, and sorry if this sounds like 'sacrifice a goat in the name of Zalgo', but there it is.

Thanks Brian for fixing in the next release

Easy solution for now:

Use 1 and 2 suffixes on in and out parameters rather than # notation