The Logic Behind The Naming System In 1000Genomes Ftp Arrangements
2
2
Entering edit mode
11.6 years ago
Delinquentme ▴ 200

I've got these links:

data/NA18603/sequence_read/ERR000103.filt.fastq.gz
data/NA18603/sequence_read/ERR000103_1.filt.fastq.gz
data/NA18603/sequence_read/ERR000103_2.filt.fastq.gz
data/NA18542/sequence_read/ERR000104.filt.fastq.gz
data/NA18542/sequence_read/ERR000104_1.filt.fastq.gz
data/NA18542/sequence_read/ERR000104_2.filt.fastq.gz
data/NA18582/sequence_read/ERR000105.filt.fastq.gz
data/NA18582/sequence_read/ERR000105_1.filt.fastq.gz
data/NA18582/sequence_read/ERR000105_2.filt.fastq.gz
data/NA18592/sequence_read/ERR000106.filt.fastq.gz
data/NA18592/sequence_read/ERR000106_1.filt.fastq.gz
data/NA18592/sequence_read/ERR000106_2.filt.fastq.gz
data/NA18605/sequence_read/ERR000107.filt.fastq.gz
data/NA18605/sequence_read/ERR000107_1.filt.fastq.gz
data/NA18605/sequence_read/ERR000107_2.filt.fastq.gz
data/NA18592/sequence_read/ERR000108.filt.fastq.gz
data/NA18592/sequence_read/ERR000108_1.filt.fastq.gz
data/NA18592/sequence_read/ERR000108_2.filt.fastq.gz
data/NA12234/sequence_read/ERR000130.filt.fastq.gz
data/NA12234/sequence_read/ERR000130_1.filt.fastq.g

now im trying to figure out:
1) are they all from the same human ?
2) why does the NA18* number change ? 3) why there are 3 versions of each ERR000* file (I thought matched were 2 (paired reads))

genome • 2.6k views
ADD COMMENT
4
Entering edit mode
11.6 years ago
Mitch Bekritsky ★ 1.3k

They are not all from the same person. The NA18* number is the ID number for the individual being sequenced. There are 3 sequence files, one for PE1, one for PE2, and the final for reads where at least one of the paired ends didn't pass QC. The QC protocol, as well as a lot of other information on 1000 genomes sequencing data (including most of what I've told you here) can be found [?]here[?]. In the past, when I've had questions about 1000 genomes sequencing info, I've found their [?]FTP site[?] to be a great resource.

ADD COMMENT
0
Entering edit mode

the sequence alignments i've run before were grouped by chromosome. So then these simply have the entire 23 chromosomes in a file each ?

ADD REPLY
0
Entering edit mode

Yeah, should be. I've never checked the alignment files from 1K genomes, but generally, alignment output files are for all aligned regions. For 1K genomes, that means all autosomes, sex chromosomes, mitochondrial chromosome, and non-chromosomal supercontigs. The link to the 1K genome project's description of the alignment protocol is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/README.human_g1k_v37.fasta.txt

ADD REPLY
0
Entering edit mode

To be clear, that's all aligned regions in the one alignment output file...

ADD REPLY
2
Entering edit mode
11.6 years ago
Bert Overduin ★ 3.7k

These kind of questions are normally answered by reading the README file .... :)

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README.sequence_data

Cheers, Bert

ADD COMMENT

Login before adding your answer.

Traffic: 1931 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6