Question

Does anyone know any specific 1000 genomes WGS data that is around 30x coverage?

0

Entering edit mode

9.5 years ago

ebrown1955 ▴ 320

I noticed that 1000 genomes data files don't seem to say what the coverage is for each sample, only that some are high coverage and some are low coverage. I know that high coverage can be anywhere between 30 and 50x, however, I'm looking for a sample that is closer to 30x so I can test my processing pipeline with data close to what I will be receiving.

I have also heard about generating "dummy" data, however I'm not sure how to go about doing this.

1000genomes controlsample • 4.7k views

ADD COMMENT • link updated 9.5 years ago by LauferVA 4.8k • written 9.5 years ago by ebrown1955 ▴ 320

1

Entering edit mode

You could also just downsample a higher coverage data set as the illumina platinum genomes using picard DownsampleSam:

http://www.illumina.com/platinumgenomes/

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by trausch ★ 2.0k

Ram · Answer 1 · 2016-01-14

Most of the high-coverage PCR-free samples should be close to 30x. These samples are listed in the Supplement (Section: 3.3 High-coverage whole genome PCR-free sequencing):

http://www.nature.com/nature/journal/v526/n7571/extref/nature15393-s1.pdf

For HG03006, for instance, alignments to GRCh38 are here:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/BEB/HG03006/high_cov_alignment/

There is a bas file that lists the #mapped_bases and #duplicate_bases which you can use to calculate the exact coverage.

score 0 · Answer 2 · 2016-01-15

0

Entering edit mode

9.5 years ago

LauferVA 4.8k

in addition to what others have said, the CGI sequencing genomes are all around 38x. at this point i think about half of the 1000 genomes have also been sequenced by CGI

ADD COMMENT • link 9.5 years ago by LauferVA 4.8k

Ram · Answer 3 · 2016-01-15

For fastq files, you can downsample with the BBMap package. For example, assume you have paired read files r1.fq and r2.fq, and you want 30x coverage of the human genome, which let's say is 3Gbp, so you want 90Gbp.

reformat.sh in1=r1.fq in2=r2.fq out1=sampled1.fq out2=sampled2.fq samplebasestarget=90000000000

You can add the ".gz" extension for compressed input and/or output, which I highly recommend when using such large files (e.g. "in1=r1.fq.gz" if the input file is gzipped, and "out1=sampled1.fq.gz" to produce compressed output).

For generating fake data, you can also use the BBMap package's randomreads utility, though in this case (since you plan to call variants) I would recommend Heng Li's wgsim utility, as it is designed specifically to generate data mimicking a real diploid human. Synthetic data is useful if you are bandwidth-constrained, time-constrained, are doing a benchmark that requires known answers, or can't find real data that exactly suits your needs (like a specific read length). If those constraints do not apply you should use real data.