Question

error correction using BBTools

0

Entering edit mode

6.7 years ago

Anand Rao ▴ 630

Based on various posts at Biostars, I gather than error correction (EC) is important for accurate de Bruijn map generation, and de novo sequence. I am trying to stick with BBTools for all my pre-processing steps.

I've completed the following steps: Force Trim Modulo, Adapter Trimming, Quality Trimming, PHI-X174 check & removal, Human contamination check & removal. So for this EC step, I seem to be running out of memory for bbnorm.sh, or even khist.sh.

My khist.sh syntax and STDOUT are shown below.

aksrao@farm:~/FUSARIUM/solexa/FilterbyTile$ srun --partition=high --time=5:00:00 khist.sh in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq  khist=khist_preEC.txt peaks=peaks_preEC.txt bits=8
srun: job 13889377 queued and waiting for resources
srun: job 13889377 has been allocated resources
java -ea -Xmx49554m -Xms49554m -cp /share/apps/bbmap-36-67/current/ jgi.KmerNormalize bits=32 ecc=f passes=1 keepall dr=f prefilter hist=stdout minprob=0 minqual=0 mindepth=0 minkmers=1 hashes=3 in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq khist=khist_preEC.txt peaks=peaks_preEC.txt bits=8
Executing jgi.KmerNormalize [bits=32, ecc=f, passes=1, keepall, dr=f, prefilter, hist=stdout, minprob=0, minqual=0, mindepth=0, minkmers=1, hashes=3, in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq, in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq, khist=khist_preEC.txt, peaks=peaks_preEC.txt, bits=8]


Settings:
threads:            24
k:                  31
deterministic:      false
toss error reads:   false
passes:             1
bits per cell:      8
cells:              23.58B
hashes:             3
prefilter bits:     2
prefilter cells:    50.79B
prefilter hashes:   2
base min quality:   0
kmer min prob:      0.0

target depth:       100
min depth:          0
max depth:          100
min good kmers:     1
depth percentile:   54.0
ignore dupe kmers:  true
fix spikes:         false
histogram length:   255
print zero cov:     false

slurmstepd: error: Step 13889377.0 exceeded memory limit (11998168 > 5324800), being killed
srun: Exceeded job memory limit
slurmstepd: error: *** STEP 13889377.0 ON c11-92 CANCELLED AT 2017-08-22T17:37:40 ***
srun: error: c11-92: task 0: Killed
srun: Force Terminated job step 13889377.0

My bbnorm.sh syntax and STDOUT are shown below. I thought using the recommended flags, explained at http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/, that I would never run into memory issues. Perhaps I am misunderstanding those instructions / syntax? Because I never ran into memory issues with Jellyfish (for example), I suspect my syntax and/or understanding of EC is incomplete/incorrect.

Any advice and suggestion most welcome. Thank you!

aksrao@farm:~/FUSARIUM/solexa/FilterbyTile$ srun --partition=high --time=5:00:00 --nodes=1 bbnorm.sh in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq out1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq out2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq ecc=t keepall passes=1 bits=16 prefilter
srun: job 13897223 queued and waiting for resources
srun: job 13897223 has been allocated resources
java -ea -Xmx48284m -Xms48284m -cp /share/apps/bbmap-36-67/current/ jgi.KmerNormalize bits=32 in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq out1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq out2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq ecc=t keepall passes=1 bits=16 prefilter
Executing jgi.KmerNormalize [bits=32, in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq, in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq, out1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq, out2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq, ecc=t, keepall, passes=1, bits=16, prefilter]


Settings:
threads:            24
k:                  31
deterministic:      true
toss error reads:   false
passes:             1
bits per cell:      16
cells:              11.49B
hashes:             3
prefilter bits:     2
prefilter cells:    49.49B
prefilter hashes:   2
base min quality:   5
kmer min prob:      0.5

target depth:       100
min depth:          5
max depth:          100
min good kmers:     15
depth percentile:   54.0
ignore dupe kmers:  true
fix spikes:         false

Enabled overlap correction (79.2% percent overlap)
slurmstepd: error: Step 13897223.0 exceeded memory limit (16771856 > 5324800), being killed
srun: Exceeded job memory limit
slurmstepd: error: *** STEP 13897223.0 ON c9-75 CANCELLED AT 2017-08-22T22:51:57 ***
srun: error: c9-75: task 0: Killed
srun: Force Terminated job step 13897223.0

error correction k-mer BBTools bbnorm khist • 4.2k views

ADD COMMENT • link updated 6.7 years ago by Brian Bushnell 20k • written 6.7 years ago by Anand Rao ▴ 630

score 3 · Accepted Answer · 2017-08-23

3

Entering edit mode

6.7 years ago

Brian Bushnell 20k

Hi Anand,

I suggest you consider this post. That's my general suggestion for preprocessing prior to assembly. I no longer recommend using BBNorm for error-correction, as Tadpole does a much better job.

As for memory kills - that's different than BBNorm running out of memory. BBNorm does not run out of memory, as the amount of memory it uses is defined upfront and unrelated to the dataset (as long as you are using Illumina data; theoretically, if you had a bunch of 100 Mbp reads, you could make it run out of memory, but I only recommend it for short reads). However, memory kills are different. That's determined by the scheduler, and sometimes schedulers will just randomly kill jobs that are doing nothing wrong because the scheduler was misconfigured by the administrators.

In this case, you need to find out what the virtual memory limit is for the job and set the -Xmx flag to lower than the limit. The exact value depends, but generally, 1.5 GB plus 5% lower should be sufficient, so if the virtual memory limit is 100 GB, try adding the flag "-Xmx93G". Obviously in this case you need a lower value since you were already getting killed at 48GB, so you need to find out from the admins why the job is getting killed and what the limits are.

ADD COMMENT • link 6.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Per your advice, using tadpole.sh, I bumped up RAM requested incrementally, and it took > 50GB for the command to execute. If you could you please look at info I've copy/pasted, and linked out to, and comment on anything out of the ordinary / abnormal, that would be helpful indeed. Thanks, Brian!

Syntax:

   #SBATCH --nodes=1
    #SBATCH --cpus-per-task=1
    #SBATCH --mem=60000
tadpole.sh t=30 in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq out1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont_EC2.fq out2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont_EC2.fq mode=correct ecc=t rollback=t pincer=t tail=t prefilter=t prealloc=t

Details in STDERR file = http://textuploader.com/d6j5i

And the STDOUT file = http://textuploader.com/d6j5b

INPUT FILE LISTINGS

-rw-rw-r-- 1 aksrao aksrao 4.6G Aug 22 00:58 EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq
-rw-rw-r-- 1 aksrao aksrao 4.5G Aug 22 00:58 EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq

OUTPUT FILE LISTINGS

-rw-rw-r-- 1 aksrao aksrao 4.6G Aug 23 15:15 EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont_EC2.fq
-rw-rw-r-- 1 aksrao aksrao 4.5G Aug 23 15:15 EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont_EC2.fq

ADD REPLY • link 6.7 years ago by Anand Rao ▴ 630

1

Entering edit mode

Hi Anand,

Looks fine to me!

ADD REPLY • link 6.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Excellent! Thanks a LOT for all your help Brian and also to genomax. Cheers!

ADD REPLY • link 6.7 years ago by Anand Rao ▴ 630

0

Entering edit mode

Brian: For tadpole.sh, under command line help menu, it says:

> Prefiltering parameters: prefilter=0         If set to a positive
> integer, use a countmin sketch
>                     to ignore kmers with depth of that value or lower.

Further down, under

> Error-correction parameters:

I am not sure I see a flag named - "countmin" Could you please clarify? Thanks!

In my syntax in previous reply, I used

prefilter=t

Is that even correct, because it seems like it should be an integer value.

Based on your post at seqanwers -http://seqanswers.com/forums/showthread.php?t=61445, it seems I could use “minprob=0.8”. Is that what you refer to by "countmin sketch"? I am a little confused here.

Also, I wonder if it is simpler to just not use these 2 flags prefilter and minprob, and let the software do its thing... Your thoughts / advice?

ADD REPLY • link 6.6 years ago by Anand Rao ▴ 630

1

Entering edit mode

"prefilter=t" works, and becomes "prefilter=2". You don't need to use prefilter unless you run out of memory (basically, for large genomes and large datasets). A Count-min sketch is a type of Bloom filter. It's unrelated to the "mincount" or "minprob" flag. "minprob" ignores individual kmers, as encountered in a read, with a probability of correctness below a certain value. "prefilter" first counts everything, then ignores kmers with a probable count below a certain value. "mincount" is different; it's mainly for assembly (don't assemble kmers with count less than X).

Just ignore the advanced flags for now; the defaults are fine, and you only need to adjust K as needed based on read length (typically 1/3rd of read length is good for error correction). You can adjust minprob and prefilter if you run out of memory.

ADD REPLY • link 6.6 years ago by Brian Bushnell 20k

0

Entering edit mode

If after quality trimming read length is not constant, but a distribution of lengths, lets says 72-150bp, then would it be k < 72/3, OR k < 150/3.

I am thinking k < min/3 - yes? So in my example here, I could go for k=23, right?

OR would be it be k < max/3, in this example k < 50. I think not, but I wanna be sure...Thx!

Interestingly, my runs completed without having specifed k value.... strange! Right?

ADD REPLY • link 6.6 years ago by Anand Rao ▴ 630

1

Entering edit mode

No, 23 is too short, and most of your reads should still be 150bp long. Longer kmers are generally better as long as you have sufficient depth. I'd recommend 50.

ADD REPLY • link 6.6 years ago by Brian Bushnell 20k