randomreads.sh

Question

Tutorial:Yes .. BBMap can do that! - Part II randomreads (read simulation), demuxbyname/filterbyname, bbsplit (read binning/decontamination) and pileup (coverage stats)

11

Entering edit mode

4.5 years ago

GenoMax 152k

NOTE: This collection was originally posted at SeqAnswers.com. Creating a copy here to preserve the information.
Part I is available here: Yes .. BBMap can do that! - Part I : bbmap (aligner), bbduk (scan/trim), repair (fix PE reads) and reformat (format conversions)
Part III is available here: Yes .. BBMap can do that! - Part III clumpify (mark (and dedupe) duplicates without alignment), mutate (create mutant genomes) and other miscellaneous tools

Guides for many BBTools programs are available at this link.
Program is available for download here.

randomreads.sh

- Generate random reads in various formats

$ randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000

You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.

Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly.

You can generate the same set of reads with and without SNPs by fixing the seed to a positive number, like this:

$ randomreads.sh maxsnps=0 adderrors=false out=perfect.fastq reads=1000 minlength=18 maxlength=55 seed=5
$ randomreads.sh maxsnps=2 snprate=1 adderrors=false out=2snps.fastq reads=1000 minlength=18 maxlength=55 seed=5

[As of BBmap v. 36.59] randomreads.sh has the ability to simulate metagenomes. coverage=X will automatically set "reads" to a level that will give X average coverage (decimal point is allowed). metagenome will assign each scaffold a random exponential variable, which decides the probability that a read be generated from that scaffold. So, if you concatenate together 20 bacterial genomes, you can run randomreads and get a metagenomic-like distribution. It could also be used for RNA-seq when using a transcriptome reference. The coverage is decided on a per-reference-sequence level, so if a bacterial assembly has more than one contig, you may want to glue them together first with fuse.sh before concatenating them with the other references.

- Simulate a jump library

You can simulate a 4000bp jump library from your existing data like this.

$ cat assembly1.fa assembly2.fa > combined.fa
$ bbmap.sh ref=combined.fa
$ randomreads.sh reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz

shred.sh

$ shred.sh in=ref.fasta out=reads.fastq length=200

The difference is that RandomReads will make reads in a random order from random locations, ensuring flat coverage on average, but it won't ensure 100% coverage unless you generate many fold depth. Shred, on the other hand, gives you exactly 1x depth and exactly 100% coverage (and is not capable of modelling errors). So, the use-cases are different.

Demuxbyname.sh

- Demultiplex fastq files when the tag is present in the fastq read header (illumina)

$ demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...
outu=filename

"Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming.

readlength.sh

- Plotting the length distribution of reads

$ readlength.sh in=file out=histogram.txt bin=10 max=80000

That will plot the result in bins of size 10, with everything above 80k placed in the same bin. The defaults are set for relatively short sequences so if they are many megabases long you may need to add the flag "-Xmx8g" and increase "max=" to something much higher.

Alternatively, if these are assemblies and you're interested in continuity information (L50, N50, etc), you can run stats on each or statswrapper on all of them:

stats.sh in=file

or

statswrapper.sh in=file,file,file,file…

filterbyname.sh

By default, "filterbyname" discards reads with names in your name list, and keeps the rest. To include them and discard the others, do this:

$ filterbyname.sh in=003.fastq out=filter003.fq names=names003.txt include=t

getreads.sh

If you only know the number(s) of the fasta/fastq record(s) in a file (records start at 0) then you can use the following command to extract those reads in a new file.

$ getreads.sh in=<file> id=<number,number,number...> out=<file>

The first read (or pair) has ID 0, the second read (or pair) has ID 1, etc.

Parameters:
in=<file> Specify the input file, or stdin.
out=<file> Specify the output file, or stdout.
id= Comma delimited list of numbers or ranges, in any order.
For example: id=5,93,17-31,8,0,12-13

splitsam.sh

- Splits a sam file into forward and reverse reads

splitsam.sh mapped.sam plus.sam minus.sam unmapped.sam
reformat.sh in=plus.sam out=plus.fq
reformat.sh in=minus.sam out=minus.fq rcomp

bbsplit.sh

BBSplit now has the ability to output paired reads in dual files using the # symbol. For example:

$ bbsplit.sh ref=x.fa,y.fa in1=read1.fq in2=read2.fq basename=o%_#.fq

will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq

You can use the # symbol for input also, like "in=read#.fq", and it will get expanded into 1 and 2.

Added feature: One can specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Reason this is useful is to use BBSplit is to have it split input into one output file per reference file.

NOTE: 1 By default BBSplit uses fairly strict mapping parameters; you can get the same sensitivity as BBMap by adding the flags minid=0.76 maxindel=16k minhits=1. With those parameters it is extremely sensitive.

NOTE: 2 BBSplit has different ambiguity settings for dealing with reads that map to multiple genomes. In any case, if the alignment score is higher to one genome than another, it will be associated with that genome only (this considers the combined scores of read pairs - pairs are always kept together). But when a read or pair has two identically-scoring mapping locations, on different genomes, the behavior is controlled by the "ambig2" flag - "ambig2=toss" will discard the read, "all" will send it to all output files, and "split" will send it to a separate file for ambiguously-mapped reads (one per genome to which it maps).

NOTE: 3 Zero-count lines are suppressed by default, but they should be printed if you include the flag "nzo=f" (nonzeroonly=false).

NOTE: 4 BBSplit needs multiple reference files as input; one per organism, or one for target and another for everything else. It only outputs one file per reference file.

seal.sh, on the other hand, which is similar, can use a single concatenated file, as it (by default) will output one file per reference sequence within a concatenated set of references.

pileup.sh

- To generate transcript coverage stats

$ pileup.sh in=mapped.sam normcov=normcoverage.txt normb=20 stats=stats.txt

That will generate coverage per transcript, with 20 lines per transcript, each line showing the coverage for that fraction of the transcript. "stats" will contain other information like the fraction of bases in each transcript that was covered.

- To calculate physical coverage stats (region covered by paired-end reads)

BBMap has a "physcov" flag that allows it to report physical rather than sequenced coverage. It can be used directly in BBMap, or with pileup, if you already have a sam file. For example:

$ pileup.sh in=mapped.sam covstats=coverage.txt

- Calculating coverage of the genome

Program will take sam or bam, sorted or unsorted.

$ pileup.sh in=mapped.sam out=stats.txt hist=histogram.txt

stats.txt will contain the average depth and percent covered of each reference sequence; the histogram will contain the exact number of bases with a each coverage level. You can also get per-base coverage or binned coverage if you want to plot the coverage. It also generates median and standard deviation, and so forth.

It's also possible to generate coverage directly from BBMap, without an intermediate sam file, like this:

$ bbmap.sh in=reads.fq ref=reference.fasta nodisk covstats=stats.txt covhist=histogram.txt

We use this a lot in situations where all you care about is coverage distributions, which is somewhat common in metagenome assemblies. It also supports most of the flags that pileup.sh supports, though the syntax is slightly different to prevent collisions. In each case you can see all the possible flags by running the shellscript with no arguments.

- To bin aligned reads

$ pileup.sh in=mapped.sam out=stats.txt bincov=coverage.txt binsize=1000

That will give coverage within each bin. For read density regardless of read length, add the "startcov=t" flag.

dedupe.sh

Dedupe ensures that there is at most one copy of any input sequence, optionally allowing contaminants (substrings) to be removed, and a variable hamming or edit distance to be specified. Usage:

$ dedupe.sh in=assembly1.fa,assembly2.fa out=merged.fa

That will absorb exact duplicates and containments. You can use hdist and edist flags to allow mismatches, or get a complete list of flags by running the shellscript with no arguments.

Dedupe <u>will merge assemblies</u>, but it <u>will not produce consensus sequences or join overlapping reads</u>; it only removes sequences that are fully contained within other sequences (allowing the specified number of mismatches or edits).

Dedupe can remove duplicate reads from multiple files simultaneously, if they are comma-delimited (e.g. in=file1.fastq,file2.fastq,file3.fastq). And if you set the flag uniqueonly=t then ALL copies of duplicate reads will be removed, as opposed to the default behavior of leaving one copy of duplicate reads.

However, it does not care which file a read came from; in other words, it can't remove only reads that are duplicates across multiple files but leave the ones that are duplicates within a file. That can still be accomplished, though, like this:

1) Run dedupe on each sample individually, so now there are at most 1 copy of a read per sample. 2) Run dedupe again on all of the samples together, with "uniqueonly=t". The only remaining duplicate reads will be the ones duplicated between samples, so that's all that will be removed.

- Generate ROC curves from any aligner

-index the reference

$ bbmap.sh ref=reference.fasta

-Generate random reads

$ randomreads.sh reads=100000 length=100 out=synth.fastq maxq=35 midq=25 minq=15

-Map to produce a sam file

...substitute this command with the appropriate one from your aligner of choice

$ bbmap.sh in=synth.fq out=mapped.sam

-Generate ROC curve

$ samtoroc.sh in=mapped.sam reads=100000

- Calculate heterozygous rate for sequence data

$ kmercountexact.sh in=reads.fq khist=histogram.txt peaks=peaks.txt

You can examine the histogram manually, or use the "peaks" file which tells you the number of unique kmers in each peak on the histogram. For a diploid, the first peak will be the het peak, the second will be the homozygous peak, and the rest will be repeat peaks. The peak caller is not perfect, though, so particularly with noisy data I would only rely on it for the first two peaks, and try to quantify the higher-order peaks manually if you need to (which you generally don't).

- Compare mapped reads between two files

To see how many mapped reads (can be mapped concordant or discordant, doesn't matter) are shared between the two alignment files and how many mapped reads are unique to one file or the other.

$ reformat.sh in=file1.sam out=mapped1.sam mappedonly
$ reformat.sh in=file2.sam out=mapped2.sam mappedonly

That gets you the mapped reads only. Then:

$ filterbyname.sh in=mapped1.sam names=mapped2.sam out=shared.sam include=t

...which gets you the set intersection;

$ filterbyname.sh in=mapped1.sam names=mapped2.sam out=only1.sam include=f
$ filterbyname.sh in=mapped2.sam names=mapped1.sam out=only2.sam include=f

...which get you the set subtractions.

bbrename.sh

$ bbrename.sh in=old.fasta out=new.fasta

That will rename the reads as 1, 2, 3, 4, ... 222.

You can also give a custom prefix if you want. The input has to be text format, not .doc.

bbmap • 8.5k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 4.5 years ago by GenoMax 152k