The Logic Behind The Naming System In 1000Genomes Ftp Arrangements
2
2
Entering edit mode
11.6 years ago
Delinquentme ▴ 200

data/NA18603/sequence_read/ERR000103.filt.fastq.gz


now im trying to figure out:
1) are they all from the same human ?
2) why does the NA18* number change ? 3) why there are 3 versions of each ERR000* file (I thought matched were 2 (paired reads))

genome • 2.6k views
4
Entering edit mode
11.6 years ago
Mitch Bekritsky ★ 1.3k

They are not all from the same person. The NA18* number is the ID number for the individual being sequenced. There are 3 sequence files, one for PE1, one for PE2, and the final for reads where at least one of the paired ends didn't pass QC. The QC protocol, as well as a lot of other information on 1000 genomes sequencing data (including most of what I've told you here) can be found [?]here[?]. In the past, when I've had questions about 1000 genomes sequencing info, I've found their [?]FTP site[?] to be a great resource.

0
Entering edit mode

the sequence alignments i've run before were grouped by chromosome. So then these simply have the entire 23 chromosomes in a file each ?

0
Entering edit mode

Yeah, should be. I've never checked the alignment files from 1K genomes, but generally, alignment output files are for all aligned regions. For 1K genomes, that means all autosomes, sex chromosomes, mitochondrial chromosome, and non-chromosomal supercontigs. The link to the 1K genome project's description of the alignment protocol is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/README.human_g1k_v37.fasta.txt

0
Entering edit mode

To be clear, that's all aligned regions in the one alignment output file...

2
Entering edit mode
11.6 years ago
Bert Overduin ★ 3.7k

Cheers, Bert