Question

1000 Genome Project Why Only 3% Dna Covered?

6

Entering edit mode

12.4 years ago

W Langdon ▴ 90

I have downloaded several fastq files for people in the 1000 genome project. Typically these contain less than 10 million DNA sequences. Given each sequence is 36 base pairs and they are intended to overlap so each part of the genome is covered three times, this represents less than 120 million bases. That is about 3% of the total human genome.

Eg ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12878/sequence_read/SRR001783.filt.fastq.gz gives 9,457,109 Solexa-3623 DNA sequences from the Sanger Institue of 36 bp each

What am I missing?

Are the DNA sequences reported by Solexa-3623 only targeting the protein coding parts of the genome?

Any comments or help would be most welcome.

Thank you

Bill

genome fastq next-gen sequencing • 3.5k views

ADD COMMENT • link 12.4 years ago by W Langdon ▴ 90

Ram · Answer 1 · 2011-11-15

Looking at the raw sequence index file: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index Column descriptions are here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README.sequence_data

It looks like it should be a high coverage (last column) genomic sequencing library.

The SRA run id is SRR001783: https://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR001783

Perhaps the sequencing center ran the samples on just one lane because they have other sequencing jobs on other lanes?

To get all the data for the 6 individuals sequenced, you'll probably have to go through that sequence.index file and download files that have the sample_id you want and "high coverage" as the last column.

Pierre Lindenbaum · Answer 2 · 2011-11-15

As DK pointed out you can get more information about each run from the Sequence index file

laura@1000genomes[20110521]:grep SRR001783 /nfs/1000g-archive/vol1/ftp/sequence.index | cut -f1,3,4,6,8,10,11,14,19,24,25,26
data/NA12878/sequence_read/SRR001783.filt.fastq.gz      SRR001783       SRP000032       BI      2008-04-08 00:00:00     NA12878 CEU     Illumina Genome Analyzer II     SINGLE  9457109    340455924       high coverage

This fastq file contains 9,457,109 reads which is actually 340,455,924 basepairs and for 1 lane of an early 2008 GAII single ended run is about right.

Each of the high coverage individuals has about 900 individual runs associated with them from different machines from different centers and different platforms.

Remember the 1000 genomes project has been running now for more than 3 years. At the start sequencing technology produced no where near the volume of data it does now.

score 1 · Answer 3 · 2011-11-15

I think that some individuals in the 1000G project first have some survey sequencing done. Later, deeper sequencing will be performed, likely with a different technology. Thus, what you have for those few individuals is the preliminary data.

Remember, not all individuals in the 1000G project will see the same depth of coverage.

score 0 · Answer 4 · 2011-11-15

Dear DK, Pierre and Larry, Thank you for your kind, helpful and prompt replies. It seems pretty clear that I have misinterpreted the 1000genome project meta data:-( I had interpreted SRR001783 as relating to the individual rather than NA12878 and so had also got the population as TSI rather than CEU. It seems person NA12878 has been well studied with (arrording to today's sequence.index) a total of 2186 fastq files in the FTP site.

There are 95 files like SRR001783.filt.fastq.gz (ie Solexa-3623) but 8 of these have been withdrawn leaving 87 with a total BASE_COUNT of 29923 million (33386.1 million including those withdrawn).

Bill

ps: Assuming we need three times as many bases to be sequenced as there are (to get sufficent redundency) there are only 10 individuals (NA19707 HG01101 NA11832 NA19474 NA18867 NA11994 NA12878 NA12891 NA12892) which might have been fully sequenced giving data like SRR001783.filt.fastq.gz (ie BI, Illumina Genome Analyzer II, Solexa*, SINGLE ended, and a total BASE_COUNT exceeding 9858 million).