I have downloaded several fastq files for people in the 1000 genome project. Typically these contain less than 10 million DNA sequences. Given each sequence is 36 base pairs and they are intended to overlap so each part of the genome is covered three times, this represents less than 120 million bases. That is about 3% of the total human genome.
Eg ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12878/sequence_read/SRR001783.filt.fastq.gz gives 9,457,109 Solexa-3623 DNA sequences from the Sanger Institue of 36 bp each
What am I missing?
Are the DNA sequences reported by Solexa-3623 only targeting the protein coding parts of the genome?
Any comments or help would be most welcome.