BBmap pipeline output varies across replicated runs
Entering edit mode
4.8 years ago
Anand Rao ▴ 640


I am using several steps of the found under bbmap-38-60/pipelines/ for the purpose of pre-processing my SE Illumina 100nt raw reads, before mapping to a reference genome.

To insure my pre-processing steps are reproducible, I ran 1 library 4 times, through the exact same pre-processing steps, as shown below. After each step, however, starting with, the line counts are slightly different for these replicated results, as shown below.

So my question to forum members is whether you've seen this behavior, and if yes, whether it's normal. Is there a reason why this is happening - for example, random seeding, or something else?

If not, do you have ideas why this might be happening on my SLURM based HPCC, and how to prevent this minor variability? Results vary between compute nodes, and back-to-back runs on the same compute node as well !


Step 1. Rename SRR data files for BBmap compatibility

BBMap_38.61/bbmap/ in=TEST.fastq out=TEST_rename.fastq fixsra=t -Xmx20g

71369268 TEST/TEST_rename.fastq
71369268 TEST_repeat/TEST_rename.fastq
71369268 TEST_repeat2/TEST_rename.fastq
71369268 TEST_repeat3/TEST_rename.fastq


Step 2. Remove optical duplicates -Xms20g in=TEST_rename.fastq out=TEST_rename_clumped.fq.gz dedupe optical

4640107 TEST/TEST_rename_clumped.fq.gz
4640730 TEST_repeat/TEST_rename_clumped.fq.gz
4639478 TEST_repeat2/TEST_rename_clumped.fq.gz
4638977 TEST_repeat3/TEST_rename_clumped.fq.gz


Step 3. Remove poor tiles in Illumina reads -Xms20g in=TEST_rename_clumped.fq.gz out=TEST_rename_clumped_FbT.fq.gz

4319815 TEST/TEST_rename_clumped_FbT.fq.gz
4321528 TEST_repeat/TEST_rename_clumped_FbT.fq.gz
4317394 TEST_repeat2/TEST_rename_clumped_FbT.fq.gz
4320158 TEST_repeat3/TEST_rename_clumped_FbT.fq.gz


Step 4. Adapter and quality trimming in=TEST_rename_clumped_FbT.fq.gz out=TEST_rename_clumped_FbT_bbdukTrim.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5 ordered

4318431 TEST/TEST_rename_clumped_FbT_bbdukTrim.fq.gz
4322073 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim.fq.gz
4320215 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim.fq.gz
4316559 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim.fq.gz


Step 5. Filter out artifacts and PhiX spike-ins in=TEST_rename_clumped_FbT_bbdukTrim.fq.gz out=TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz k=31 ref=artifacts,phix ordered cardinality

4319563 TEST/TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz
4317725 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz
4325693 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz
4316804 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz


Step 6. Filter out rRNA ref=MtrunA17r5.0-ANR-EGN-r1.6.rrna.fasta.shIDscleaned-up in=TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz outu=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz outm=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAmatched_shIDScleaned.fq.gz nodisk

4303501 TEST/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz
4308744 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz
4310331 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz
4308370 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz

29032 TEST/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAmatched_shIDScleaned.fq.gz
29457 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAmatched_shIDScleaned.fq.gz
29577 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAmatched_shIDScleaned.fq.gz
29395 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAmatched_shIDScleaned.fq.gz


Step 7. Filter out human and common bacterial contaminants

cat hg19_masked.fa.gz fusedEPmasked2.fa.gz > hg19_mask ref=hg19_masked_fusedEPmasked2.fa.gz in=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz outu=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz outm=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_HGBact.fq.gz nodisk

4313738 TEST/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz
4324692 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz
4322861 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz
4319180 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz

1421 TEST/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_HGBact.fq.gz
1317 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_HGBact.fq.gz
1282 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_HGBact.fq.gz
1357 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_HGBact.fq.gz


Step 8. Split mapping to plant (PacBio v5) versus bacterial (strain 1021) genomes in=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz ref=MtrunA17r5.0-20161119-ANR.fasta,Sm1021_Chr_pSymAB.fasta basename=TEST_BBsplit_%.fq outu=TEST_MtA17_Sm1021_UNmapped.fq


61328164 TEST/TEST_BBsplit_MtrunA17r5.0-20161119-ANR.fq
61332572 TEST_repeat/TEST_BBsplit_MtrunA17r5.0-20161119-ANR.fq
61328536 TEST_repeat2/TEST_BBsplit_MtrunA17r5.0-20161119-ANR.fq
61311576 TEST_repeat3/TEST_BBsplit_MtrunA17r5.0-20161119-ANR.fq

224 TEST/TEST_BBsplit_Sm1021_Chr_pSymAB.fq
216 TEST_repeat/TEST_BBsplit_Sm1021_Chr_pSymAB.fq
224 TEST_repeat2/TEST_BBsplit_Sm1021_Chr_pSymAB.fq
216 TEST_repeat3/TEST_BBsplit_Sm1021_Chr_pSymAB.fq

7563888 TEST/TEST_MtA17_Sm1021_UNmapped.fq
7563940 TEST_repeat/TEST_MtA17_Sm1021_UNmapped.fq
7563708 TEST_repeat2/TEST_MtA17_Sm1021_UNmapped.fq
7560668 TEST_repeat3/TEST_MtA17_Sm1021_UNmapped.fq


bbmap reproducibility clumpify • 2.0k views
Entering edit mode
4.8 years ago
GenoMax 144k

Are you using more than one core for these jobs?

I am going to ask that you try setting a seed up with

seed=1              Random number generator seed for hashing.  
                    Set to a negative number to use a random seed.

Differences you are seeing are likely because of this. Set the seed to a constant number across the runs and let us see what happens.

Also for you will want to turn the following option on to make the results deterministic. You will need to provide an average insert size.

deterministic=f         Run in deterministic mode.  In this case it is good
                        to set averagepairdist.  BBMap is deterministic
                        without this flag if using single-ended reads,
                        or run singlethreaded. uses undercover to do its work but it does not have a deterministic flag. So there you may want to run things on a single core.

Entering edit mode

Hi genomax,

Thanks for your super quick post.

Yes, I am using as many threads as are available, which is usually 8 or 12 cores, each with 2 cpus (though my terminologies may be muddled, it is certainly > 1 core)

For my step 1, run I will try seed = 1, thanks for that suggestion.

For specifying the value of average insert size, I looked up how to specify it, and found this seqanswers link.

Since my reads are SE, not PE, and for RNA transcripts, I suppose, of the 3 options listed (mapping Vs overlap vs. assembly) , I only have 1) Via Mapping as my viable option, correct?

This Mapping strategy to evaluate average insert size seems a little circular in logic - In that I am executing my pre-processing steps PRIOR to the mapping step, but here, I'd need to map using but before pre-processing, in order to specify average insert size to later run in deterministic mode. Is this a problem, or is average insert size a best approximation that's good enough, and I do not have to run these steps iteratively for an exact average insert size value?

In the deterministic flag explanation, it says

"BBMap is deterministic without this flag if using single-ended reads, or run singlethreaded."

So my guess is that bbmap did run deterministic, since my data is SE not PE reads, but I look forward to your comments. Thanks again!

Entering edit mode

You can set the seed to any positive integer value. It does not need to be 1. But it needs to be identical for all repeat runs.

Since you have single-end data there is no way to estimate/determine insert size. Sorry I missed that part. It also does not matter with mapping since you have single-end data. So in your case alignments should have been deterministic.

Using more than one core sometimes leads to slightly different results (though overall conclusion should not change). I will ping @Brian to see if he can add his final word.

Entering edit mode

Genomax: Here is my 3 part update

PART 1 No problem with step - gave matching results even previously

/BBMap_38.61/bbmap/ in=SRR1726611.fastq out=SRR1726611_rename.fastq fixsra=t -Xmx20g # MATCHING RESULTS

PART 2 Your suggestion for seed=1 allowed me toreproduce my results for, using the syntax below: -Xms20g in=SRR1726611_rename.fastq out=SRR1726611_rename_clumped.fq.gz dedupe optical seed=1 threads=1 # MATCHING RESULTS

PART 3 However, the next step - FILTER BY TILE using, gives results that differs between runs -Xms20g in=SRR1726611_rename_clumped.fq.gz out=SRR1726611_rename_clumped_FbT.fq.gz threads=1 # RESULTS DO NOT MATCH

Is there a way to enforce reproducibility of runs?

Entering edit mode

I have never found much use for If you have some bad data then an aligner is generally going to be able to take care of those reads. As to why the results differ slightly I can't say. It must have something to do with algorithmic implementation for that tool. You can try dropping that step out.

Entering edit mode

I agree, the aligner should be able to take care of any bad reads in bad tiles.

Pursuing my non-reproducibility of exact matched outputs for replicated runs, specifically for, I just found that using the following flag allows me to obtain reproducible results with: usekmers=f

This syntax idea arose from an old seqanswers post by BBMap author Brian Bushnell at this link

Disabling kmer uniqueness to increase speed and decrease memory usage: Code: in=x.fq out=y.fq usekmers=f

However, in the help menu for, version 38-60, it says -

A micro-tile is discarded if any of 3 metrics indicate a problem. The metrics are kmer uniqueness (u), average quality (q), and probability of being error-free (e). Each has 3 parameters: deviations (d), fraction (f), and absolute (a). After calculating the difference (delta) between a micro-tile and average, it is discarded only if all three of these conditions are true for at least one metric (using quality as the example): 1) delta is greater than (qd) standard deviations. 2) delta is greater than average times the fraction (qf). 3) delta is greater than the absolute value (qa).

Therefore, I wonder under what circumstances using only average quality (q), and probability of being error-free (e), and excluding kmer uniqueness (u) would be OK versus not OK.... Please share your thoughts / experience. Thanks!

Entering edit mode
4.8 years ago
Anand Rao ▴ 640

Here are my findings based on playing around with flags and trying to reproduce results:

1. Variability in results from may be avoided with this flag during the run: usekmers=f. However, it is not entirely clear to me what circumstances allow versus disallow the use of this flag, hope the BBMap author will provide some insights

2. Variability in results from may be avoided by invoking the slow or vslow flags. This of course has the drawback of running slower than usual, but on the flip side the advantage of increasing mapping sensitivity. I cannot imagine scenarios where using either of these flags would be considered invalid.

Entering edit mode

Response from BBmap software author Brian Bushnell - at this SourceForge link, and I quote him here:

I added a "seed" flag for the next release (30.71), which should make the program deterministic. The default seed is -1, and negative seeds will still produce nondeterminstic output. Positive seeds will set the random number generator into a specific state and produce deterministic output as long as the same seed is used.

I'm pretty sure he intended to say version 38.71, and not version 30.71.

Entering edit mode

What is the impetus behind this study. Very few people re-run analysis on their samples over and over unless you are re-analyzing a larger pool after addition of new samples and the analysis somehow depends on the entire data pool. For a biologist, reproduction of results would mean getting the same set of genes after doing a DE analysis irrespective of differences in reads that may have aligned.

Entering edit mode

I'll keep that in mind, thanks Genomax. Just trying to make sure no one knocks me for non-reproducibility, and so I am checking that at each step in the pipeline. Perhaps that is overkill based on your experience... Thanks!


Login before adding your answer.

Traffic: 1797 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6