Tutorial:Yes .. BBMap can do that! - Part III clumpify (mark (and dedupe) duplicates without alignment), mutate (create mutant genomes) and other miscellaneous tools
0
5
Entering edit mode
22 months ago
GenoMax 123k

NOTE: This collection was originally posted at SeqAnswers.com. Creating a copy here to preserve the information.
Part I is available here: Yes .. BBMap can do that! - Part I : bbmap (aligner), bbduk (scan/trim), repair (fix PE reads) and reformat (format conversions)
Part II is available here: Yes .. BBMap can do that! - Part II randomreads (read simulation), demuxbyname/filterbyname, bbsplit (read binning/decontamination) and pileup (coverage stats)

User guides for many BBMap suite programs are available at this link.

$bfakereads.sh in=reads.fastq out1=r1.fastq out2=r2.fastq length=100  That will generate fake pairs from the input file, with whatever length you want (maximum of input read length). We use it in some cases for generating a fake LMP library for scaffolding from a set of contigs. Read 1 will be from the left end, and read 2 will be reverse-complemented and from the right end; both will retain the correct original qualities. And " /1" " /2" will be suffixed after the read name. ### - Generate saturation curves to assess sequencing depth $ bbcountunique.sh in=reads.fq out=histogram.txt


It works by pulling kmers from each input read, and testing whether it has been seen before, then storing it in a table.

The bottom line, "first", tracks whether the first kmer of the read has been seen before (independent of whether it is read 1 or read 2).

The top line, "pair", indicates whether a combined kmer from both read 1 and read 2 has been seen before. The other lines are generally safe to ignore but they track other things, like read1- or read2-specific data, and random kmers versus the first kmer.

It plots a point every X reads (configurable, default 25000).

In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.

In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.

# calctruequality.sh

In light of the quality-score issues with the NextSeq platform, and the possibility of future Illumina platforms (HiSeq 3000 and 4000) also using quantized quality scores, Brian developed it for recalibrating the scores to ensure accuracy and restore the full range of values.

# bbmapskimmer.sh

BBMap is designed to find the best mapping, and heuristics will cause it to ignore mappings that are valid but substantially worse. Therefore, I made a different version of it, BBMapSkimmer, which is designed to find all of the mappings above a certain threshold. The shellscript is bbmapskimmer.sh and the usage is similar to bbmap.sh or mapPacBio.sh. For primers, which I assume will be short, you may wish to use a lower than default K of, say, 10 or 11, and add the "slow" flag.

# msa.sh and curprimers.sh

Quoted from Brian's response directly.

I also wrote another pair of programs specifically for working with primer pairs, msa.sh and cutprimers.sh. msa.sh will forcibly align a primer sequence (or a set of primer sequences) against a set of reference sequences to find the single best matching location per reference sequence - in other words, if you have 3 primers and 100 ref sequences, it will output a sam file with exactly 100 alignments - one per ref sequence, using the primer sequence that matched best. Of course you can also just run it with 1 primer sequence.

So you run msa twice - once for the left primer, and once for the right primer - and generate 2 sam files. Then you feed those into cutprimers.sh, which will create a new fasta file containing the sequence between the primers, for each reference sequence. We used these programs to synthetically cut V4 out of full-length 16S sequences.

I should say, though, that the primer sites identified are based on the normal BBMap scoring, which is not necessarily the same as where the primers would bind naturally, though with highly conserved regions there should be no difference.

$testformat.sh in=seq.fq.gz sanger fastq gz interleaved 150bp  # kcompress.sh Newest member of BBTools. Identify constituent k-mers. http://seqanswers.com/forums/showthread.php?t=63258 # commonkmers.sh Find all k-mers for a given sequence. $ commonkmers.sh in=reads.fq out=kmers.txt k=4 count=t display=999


Will produce output that looks like

MISEQ05:239:000000000-A74HF:1:2110:14788:23085  ATGA=8  ATGC=6  GTCA=6  AAAT=5  AAGC=5  AATG=5  AGCA=5  ATAA=5  ATTA=5  CAAA=5  CATA=5  CATC=5  CTGC=5  AACC=4  AACG=4  AAGA=4  ACAT=4  ACCA=4  AGAA=4  ATCA=4  ATGG=4  CAAG=4  CCAA=4  CCTC=4  CTCA=4  CTGA=4  CTTC=4  GAGC=4  GGTA=4  GTAA=4  GTTA=4  AAAA=3  AAAC=3  AAGT=3  ACCG=3  ACGG=3  ACTG=3  AGAT=3  AGCT=3  AGGA=3  AGTA=3  AGTC=3  CAGC=3  CATG=3  CGAG=3  CGGA=3  CGTC=3  CTAA=3  CTCC=3  CTTA=3  GAAA=3  GACA=3  GACC=3  GAGA=3  GCAA=3  GGAC=3  TCAA=3  TGCA=3  AAAG=2  AACA=2  AATA=2  AATC=2  ACAA=2  ACCC=2  ACCT=2  ACGA=2  ACGC=2  AGAC=2  AGCG=2  AGGC=2  CAAC=2  CAGG=2  CCGC=2  GCCA=2  GCTA=2  GGAA=2  GGCA=2  TAAA=2  TAGA=2  TCCA=2  TGAA=2  AAGG=1  AATT=1  ACGT=1  AGAG=1  AGCC=1  AGGG=1  ATAC=1  ATAG=1  ATTG=1  CACA=1  CACG=1  CAGA=1  CCAC=1  CCCA=1  CCGA=1  CCTA=1  CGAC=1  CGCA=1  CGCC=1  CGCG=1  CGTA=1  CTAC=1  GAAC=1  GCGA=1  GCGC=1  GTAC=1  GTGA=1  TTAA=1


# mutate.sh

Simulate multiple mutants from a known reference (e.g. E. coli ).

$mutate.sh in=e_coli.fasta out=mutant.fasta id=99$ randomreads.sh ref=mutant.fasta out=reads.fq.gz reads=5m length=150 paired adderrors


That will create a mutant version of E.coli with 99% identity to the original, and then generate 5 million simulated read pairs from the new genome. You can repeat this multiple times; each mutant will be different.

# partition.sh

One can partition a large dataset with partition.sh into smaller subsets (example below splits data into 8 chunks).

partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8


# clumpify.sh

clumpify is described in a detailed thread here: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

If you are concerned about file size and want the files to be as small as possible, give Clumpify a try. It can reduce filesize by around 30% losslessly by reordering the reads. I've found that this also typically accelerates subsequent analysis pipelines by a similar factor (up to 30%). Usage:

clumpify.sh in=reads.fastq.gz out=clumped.fastq.gz



clumpify does NOT require alignments so it should prove more useful compared to Picard MarkDuplicates. Relevant options for clumpify.sh command are listed below.

dedupe=f optical=f (default)
Nothing happens with regards to duplicates.

dedupe=t optical=f
All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.

dedupe=f optical=t
Nothing happens.

dedupe=t optical=t

Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.
The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.

Note: If you set "dupedist" to anything greater than 0, "optical" gets enabled automatically.


# fuse.sh

Fuse will automatically reverse-complement read 2. Pad (N) amount can be adjusted as necessary. This will for example create a full size amplicon that can be used for alignments.

fuse.sh in1=r1.fq in2=r2.fq pad=130 out=fused.fq fusepairs


# partition.sh

Following command will produce 10 output files with an equal number of sequences and no duplication.

partition.sh in=X.fa out=X%.fa ways=10

bbmap Tutorial • 3.2k views