Hg19 Versus Grch37
10.3 years ago
Can anyone explain why these two chromosome 1 files are different (that to others as well)? I'm under the impression hg19 and GRC37 are the same reference genomes, but it looks like the hg19 version has a bunch of leading NNN placeholders that can affect searching the two genomes by position.

ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh37.p2_chr1.fa.gz

chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN


gi|224514618|ref|NT_077402.2| Homo sapiens chromosome 1 genomic contig, GRCh37.p2 reference primary assembly TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAA CCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCT AACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTA ACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACC CTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTA ACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCG CCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGAC AACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGG


Thank you!

Looks like my question has been answered though by posting the "correct" link

10.3 years ago
You must "head" a wrong file. Please do that again. hsrefGRCh37.p2_chr1.fa has lots of "N" bases at the beginning.

EDIT: GRC distributes the reference genome in two versions: one as contigs and the other as assembled chromosomes. The latter is in the "assembled_chromosome" directory. I do not know who are using the contigs, but nearly everyone I know use assembled chromosomes only.

ahh you're right. However, here was the file I was doing the "head" command on: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/hs_ref_GRCh37.p2_chr1.fa.gz

That file is found in a different directory, same name though?

That file contains contigs, not chromosomes. Different things.

10.3 years ago
The files are not different - they contain identical sequence and both begin with a run of 'N' characters.

For future reference, here's a quick way to count the bases:

# chr1
tail -n+2 chr1.fa | awk '{ for ( i=1; i<=length; i++ ) arr[substr($0, i, 1)]++ }END{ for ( i in arr ) { print i, arr[i] } }' A 32485284 N 23970000 C 23064132 T 32559153 G 23070958 a 33085607 c 23960280 g 23945604 t 33109603 # hs_ref_GRCh37.p2_chr1 tail -n+2 hs_ref_GRCh37.p2_chr1.fa | awk '{ for ( i=1; i<=length; i++ ) arr[substr($0, i, 1)]++ }END{ for ( i in arr ) { print i, arr[i] } }'
A 65570891
N 23970000
C 47024412
G 47016562
T 65668756


If you do some summing, you'll see that both files contain exactly the same count for each base and the total length is 249,250,621.

thank you, yes I realized I provided the wrong link. It looks like the file here ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/hs_ref_GRCh37.p2_chr1.fa.gz does not match the file here ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh37.p2_chr1.fa.gz

10.3 years ago
There are some differences here and there (e.g. "chr"-omitted and the annotations seems to be 0-indexed vs. 1-indexed) but nothing as drastic as you showed. It is still a major pain though.

GATK has taken clear sides with making b37 their primary ref-genome but PLINK and GWAS study are hg19 based. So I guess we have to deal with this issue a little bit longer.

