Question: Does anyone know any specific 1000 genomes WGS data that is around 30x coverage?
gravatar for ebrown1955
3.2 years ago by
United States
ebrown1955290 wrote:

I noticed that 1000 genomes data files don't seem to say what the coverage is for each sample, only that some are high coverage and some are low coverage. I know that high coverage can be anywhere between 30 and 50x, however, I'm looking for a sample that is closer to 30x so I can test my processing pipeline with data close to what I will be receiving.

I have also heard about generating "dummy" data, however I'm not sure how to go about doing this.

controlsample 1000genomes • 1.6k views
ADD COMMENTlink modified 3.2 years ago by Vincent Laufer1.0k • written 3.2 years ago by ebrown1955290

You could also just downsample a higher coverage data set as the illumina platinum genomes using picard DownsampleSam:

ADD REPLYlink written 3.2 years ago by trausch1.2k
gravatar for trausch
3.2 years ago by
trausch1.2k wrote:

Most of the high-coverage PCR-free samples should be close to 30x. These samples are listed in the Supplement (Section: 3.3 High-coverage whole genome PCR-free sequencing):

For HG03006, for instance, alignments to GRCh38 are here:

There is a bas file that lists the #mapped_bases and #duplicate_bases which you can use to calculate the exact coverage.


ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by trausch1.2k
gravatar for Vincent Laufer
3.2 years ago by
Vincent Laufer1.0k
United States
Vincent Laufer1.0k wrote:

in addition to what others have said, the CGI sequencing genomes are all around 38x. at this point i think about half of the 1000 genomes have also been sequenced by CGI

ADD COMMENTlink written 3.2 years ago by Vincent Laufer1.0k
gravatar for Brian Bushnell
3.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

For fastq files, you can downsample with the BBMap package.  For example, assume you have paired read files r1.fq and r2.fq, and you want 30x coverage of the human genome, which let's say is 3Gbp, so you want 90Gbp. in1=r1.fq in2=r2.fq out1=sampled1.fq out2=sampled2.fq samplebasestarget=90000000000

You can add the ".gz" extension for compressed input and/or output, which I highly recommend when using such large files (e.g. "in1=r1.fq.gz" if the input file is gzipped, and "out1=sampled1.fq.gz" to produce compressed output).

For generating fake data, you can also use the BBMap package's randomreads utility, though in this case (since you plan to call variants) I would recommend Heng Li's wgsim utility, as it is designed specifically to generate data mimicking a real diploid human.  Synthetic data is useful if you are bandwidth-constrained, time-constrained, are doing a benchmark that requires known answers, or can't find real data that exactly suits your needs (like a specific read length).  If those constraints do not apply you should use real data.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Brian Bushnell16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 754 users visited in the last hour