Question

1000 genomes phase 3 data representation

0

Entering edit mode

6.5 years ago

puneet.as • 0

Hello, Can i get a brief overview on data that is stored in 1000 genomes phase 3 dataset for a single sample name such as HG00113 there a multitude of fastq paired end reads in their ftp server

ERR020088_1.filt.fastq.gz 24.6 GB ERR020088_2.filt.fastq.gz 24.7 GB ERR229776.filt.fastq.gz 360 MB
ERR229776_1.filt.fastq.gz 9.4 GB
ERR229776_2.filt.fastq.gz 9.7 GB
SRR070517.filt.fastq.gz 7.4 MB
SRR070517_1.filt.fastq.gz 2.2 GB
SRR070517_2.filt.fastq.gz 2.3 GB
SRR070802.filt.fastq.gz 6.8 MB
SRR070802_1.filt.fastq.gz 2.2 GB
SRR070802_2.filt.fastq.gz 2.3 GB

can someone explain as how to interpret the data is it the same sample or different samples that are included in the same run accession.

why do i get multiple set of paired end reads ??

1000 genomes phase 3 sequence data • 1.5k views

ADD COMMENT • link updated 6.5 years ago by Emily 23k • written 6.5 years ago by puneet.as • 0

score 1 · Answer 1 · 2017-11-02

Hey,

Are you sure that those files relate to HG00113? - they appear to relate to HG00101: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00101/sequence_read/

Firstly, the 'filt' suffix just indicates that the 1000 Genomes consortium has done some QC filtering of the reads, which may or may not be welcome:

These are the checks the DCC makes on the archive fastq files.

    Syntax Checks:

    -Each header line begins with @
    -The third line always starts with a +
    -There are four lines in each entry (implied by the above two rules)
    -On line3, if a name follows the + sign, the name has to match the one found in line1
    -The sequence and quality lines are the same length
    -For paired end files, the _1 and _2 files have the same number of reads in them. 
    -For SOLID colourspace fastq, each read starts with a base followed by a string of numbers

    Sequence Checks:

    -Read is longer than 35bp for Solexa, 25bp for Solid, and 30 bp for 454
    -Read does not contain any N's in the first 25, 30 or 35bp
    -Quality values are all 2 or higher in the first 25bp, 30bp or 35bp
    -The reads contain more than one type of base in the first 25, 30, or 35bp
    -Read does not contain more than 50% Ns in its whole length
    -Read does not contain characters other than ATGCN (this rule does not apply to SOLID reads)

The output files get the extension .filt.fastq.gz to indicate they have been filtered.

[source: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/historical_data/former_toplevel/README.sequence_data]

In terms of the files themselves, it's DNA from the same biopsy that's being sequenced, but by different centers, sequencers, and protocols (some are even exome-seq samples). information on each samples can be pulled from the following 64 megabyte file: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20130502.phase3.sequence.index

It may be useful download this and the using grep to extract the information for your files of interest!

Good luck,

Kevin