Tutorial:Yes .. BBMap can do that! - Part III clumpify (mark (and dedupe) duplicates without alignment), mutate (create mutant genomes) and other miscellaneous tools
Entering edit mode
3.1 years ago
GenoMax 139k

NOTE: This collection was originally posted at SeqAnswers.com. Creating a copy here to preserve the information.
Part I is available here: Yes .. BBMap can do that! - Part I : bbmap (aligner), bbduk (scan/trim), repair (fix PE reads) and reformat (format conversions)
Part II is available here: Yes .. BBMap can do that! - Part II randomreads (read simulation), demuxbyname/filterbyname, bbsplit (read binning/decontamination) and pileup (coverage stats)

User guides for many BBMap suite programs are available at this link.


- Generating “fake” paired end reads from a single end read file

$ bfakereads.sh in=reads.fastq out1=r1.fastq out2=r2.fastq length=100

That will generate fake pairs from the input file, with whatever length you want (maximum of input read length). We use it in some cases for generating a fake LMP library for scaffolding from a set of contigs. Read 1 will be from the left end, and read 2 will be reverse-complemented and from the right end; both will retain the correct original qualities. And " /1" " /2" will be suffixed after the read name.

- Generate saturation curves to assess sequencing depth

$ bbcountunique.sh in=reads.fq out=histogram.txt

It works by pulling kmers from each input read, and testing whether it has been seen before, then storing it in a table.

The bottom line, "first", tracks whether the first kmer of the read has been seen before (independent of whether it is read 1 or read 2).

The top line, "pair", indicates whether a combined kmer from both read 1 and read 2 has been seen before. The other lines are generally safe to ignore but they track other things, like read1- or read2-specific data, and random kmers versus the first kmer.

It plots a point every X reads (configurable, default 25000).

In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.

In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.



In light of the quality-score issues with the NextSeq platform, and the possibility of future Illumina platforms (HiSeq 3000 and 4000) also using quantized quality scores, Brian developed it for recalibrating the scores to ensure accuracy and restore the full range of values.


BBMap is designed to find the best mapping, and heuristics will cause it to ignore mappings that are valid but substantially worse. Therefore, I made a different version of it, BBMapSkimmer, which is designed to find all of the mappings above a certain threshold. The shellscript is bbmapskimmer.sh and the usage is similar to bbmap.sh or mapPacBio.sh. For primers, which I assume will be short, you may wish to use a lower than default K of, say, 10 or 11, and add the "slow" flag.

msa.sh and curprimers.sh

Quoted from Brian's response directly.

I also wrote another pair of programs specifically for working with primer pairs, msa.sh and cutprimers.sh. msa.sh will forcibly align a primer sequence (or a set of primer sequences) against a set of reference sequences to find the single best matching location per reference sequence - in other words, if you have 3 primers and 100 ref sequences, it will output a sam file with exactly 100 alignments - one per ref sequence, using the primer sequence that matched best. Of course you can also just run it with 1 primer sequence.

So you run msa twice - once for the left primer, and once for the right primer - and generate 2 sam files. Then you feed those into cutprimers.sh, which will create a new fasta file containing the sequence between the primers, for each reference sequence. We used these programs to synthetically cut V4 out of full-length 16S sequences.

I should say, though, that the primer sites identified are based on the normal BBMap scoring, which is not necessarily the same as where the primers would bind naturally, though with highly conserved regions there should be no difference.


Identify type of Q-score encoding in sequence files

$ testformat.sh in=seq.fq.gz
sanger    fastq    gz    interleaved    150bp


Newest member of BBTools. Identify constituent k-mers. http://seqanswers.com/forums/showthread.php?t=63258


Find all k-mers for a given sequence.

$ commonkmers.sh in=reads.fq out=kmers.txt k=4 count=t display=999

Will produce output that looks like

MISEQ05:239:000000000-A74HF:1:2110:14788:23085  ATGA=8  ATGC=6  GTCA=6  AAAT=5  AAGC=5  AATG=5  AGCA=5  ATAA=5  ATTA=5  CAAA=5  CATA=5  CATC=5  CTGC=5  AACC=4  AACG=4  AAGA=4  ACAT=4  ACCA=4  AGAA=4  ATCA=4  ATGG=4  CAAG=4  CCAA=4  CCTC=4  CTCA=4  CTGA=4  CTTC=4  GAGC=4  GGTA=4  GTAA=4  GTTA=4  AAAA=3  AAAC=3  AAGT=3  ACCG=3  ACGG=3  ACTG=3  AGAT=3  AGCT=3  AGGA=3  AGTA=3  AGTC=3  CAGC=3  CATG=3  CGAG=3  CGGA=3  CGTC=3  CTAA=3  CTCC=3  CTTA=3  GAAA=3  GACA=3  GACC=3  GAGA=3  GCAA=3  GGAC=3  TCAA=3  TGCA=3  AAAG=2  AACA=2  AATA=2  AATC=2  ACAA=2  ACCC=2  ACCT=2  ACGA=2  ACGC=2  AGAC=2  AGCG=2  AGGC=2  CAAC=2  CAGG=2  CCGC=2  GCCA=2  GCTA=2  GGAA=2  GGCA=2  TAAA=2  TAGA=2  TCCA=2  TGAA=2  AAGG=1  AATT=1  ACGT=1  AGAG=1  AGCC=1  AGGG=1  ATAC=1  ATAG=1  ATTG=1  CACA=1  CACG=1  CAGA=1  CCAC=1  CCCA=1  CCGA=1  CCTA=1  CGAC=1  CGCA=1  CGCC=1  CGCG=1  CGTA=1  CTAC=1  GAAC=1  GCGA=1  GCGC=1  GTAC=1  GTGA=1  TTAA=1


Simulate multiple mutants from a known reference (e.g. E. coli ).

$ mutate.sh in=e_coli.fasta out=mutant.fasta id=99 
$ randomreads.sh ref=mutant.fasta out=reads.fq.gz reads=5m length=150 paired adderrors

That will create a mutant version of E.coli with 99% identity to the original, and then generate 5 million simulated read pairs from the new genome. You can repeat this multiple times; each mutant will be different.


One can partition a large dataset with partition.sh into smaller subsets (example below splits data into 8 chunks).

partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8


clumpify is described in a detailed thread here: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

If you are concerned about file size and want the files to be as small as possible, give Clumpify a try. It can reduce filesize by around 30% losslessly by reordering the reads. I've found that this also typically accelerates subsequent analysis pipelines by a similar factor (up to 30%). Usage:

clumpify.sh in=reads.fastq.gz out=clumped.fastq.gz

clumpify.sh in1=reads_R1.fastq.gz in2=reads_R2.fastq.gz out1=clumped_R1.fastq.gz out2=clumped_R2.fastq.gz

clumpify does NOT require alignments so it should prove more useful compared to Picard MarkDuplicates. Relevant options for clumpify.sh command are listed below.

dedupe=f optical=f (default)
Nothing happens with regards to duplicates.

dedupe=t optical=f
All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.

dedupe=f optical=t
Nothing happens.

dedupe=t optical=t

Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.
The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.

Note: If you set "dupedist" to anything greater than 0, "optical" gets enabled automatically.


Fuse will automatically reverse-complement read 2. Pad (N) amount can be adjusted as necessary. This will for example create a full size amplicon that can be used for alignments.

fuse.sh in1=r1.fq in2=r2.fq pad=130 out=fused.fq fusepairs


Following command will produce 10 output files with an equal number of sequences and no duplication.

partition.sh in=X.fa out=X%.fa ways=10

shuffle.sh and shuffle2.sh

shuffle2.sh can randomize fastq files of arbitrary size, including paired twin files, keeping pairs together. shuffle.sh is there too but it required the input to fit in memory; shuffle2.sh will write temp files when the data won't fit into memory. It handles multiline fasta files too.


comparesketch.sh in=a.fastq ref=b.fastq minkeycount=2

...will ignore singleton kmers caused by errors. It's also very fast, so you can do an all-to-all comparison with lots of fastqs in a reasonable amount of time:

comparesketch.sh *.fastq alltoall format=3 records=9999

The fact that it does entropy filtering and has a blacklist of taxonomically uninformative conserved kmers present in many clades also prevents spurious hits of e.g. algae to octopus that otherwise confound this kind of methodology, since you are interested in "genetic distance" rather than spurious hits of ATATATATA...ATAT.


bbmask.sh can mask STRs (with repeat length 1-15bp), and findrepeats.sh can find and report or mask longer and partial repeats of various types.

bbmap • 4.8k views
Entering edit mode


I am reviewing some old scripts in which I used clumpify.sh (BBMap 38.90) with the following parameters:

./clumpify.sh in1=$dir_clean_fq/clean_${sample}_1.fq.gz
        dedupe spany adjacent

Taking into account that I didn't specify any value in dupedist, and that dupedist is default 40, according to what you indicate here "Note: If you set "dupedist" to anything greater than 0, "optical" gets enabled automatically.", I understand that if you don't specify dupedist= 0 is equivalent to use dedupe optical?

Thank you very much!

Entering edit mode

Yes that should be true. I will let Brian Bushnell confirm.


Login before adding your answer.

Traffic: 1426 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6