3.7 years ago by
Walnut Creek, USA
For fastq files, you can downsample with the BBMap package. For example, assume you have paired read files r1.fq and r2.fq, and you want 30x coverage of the human genome, which let's say is 3Gbp, so you want 90Gbp.
reformat.sh in1=r1.fq in2=r2.fq out1=sampled1.fq out2=sampled2.fq samplebasestarget=90000000000
You can add the ".gz" extension for compressed input and/or output, which I highly recommend when using such large files (e.g. "in1=r1.fq.gz" if the input file is gzipped, and "out1=sampled1.fq.gz" to produce compressed output).
For generating fake data, you can also use the BBMap package's randomreads utility, though in this case (since you plan to call variants) I would recommend Heng Li's wgsim utility, as it is designed specifically to generate data mimicking a real diploid human. Synthetic data is useful if you are bandwidth-constrained, time-constrained, are doing a benchmark that requires known answers, or can't find real data that exactly suits your needs (like a specific read length). If those constraints do not apply you should use real data.