Hg19 Versus Grch37
3
5
Entering edit mode
10.3 years ago
Sam ▴ 90

Can anyone explain why these two chromosome 1 files are different (that to others as well)? I'm under the impression hg19 and GRC37 are the same reference genomes, but it looks like the hg19 version has a bunch of leading NNN placeholders that can affect searching the two genomes by position.

ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh37.p2_chr1.fa.gz

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz

---> head chr1.fa

chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

-----> head hs_ref_GRCh37.p2_chr1.fa

gi|224514618|ref|NT_077402.2| Homo sapiens chromosome 1 genomic contig, GRCh37.p2 reference primary assembly TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAA CCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCT AACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTA ACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACC CTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTA ACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCG CCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGAC AACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGG

Thank you!

genome hg • 30k views
ADD COMMENT
0
Entering edit mode

I meant this file: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/hs_ref_GRCh37.p2_chr1.fa.gz

Looks like my question has been answered though by posting the "correct" link

ADD REPLY
5
Entering edit mode
10.3 years ago
lh3 32k

You must "head" a wrong file. Please do that again. hsrefGRCh37.p2_chr1.fa has lots of "N" bases at the beginning.

EDIT: GRC distributes the reference genome in two versions: one as contigs and the other as assembled chromosomes. The latter is in the "assembled_chromosome" directory. I do not know who are using the contigs, but nearly everyone I know use assembled chromosomes only.

ADD COMMENT
0
Entering edit mode

ahh you're right. However, here was the file I was doing the "head" command on: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/hs_ref_GRCh37.p2_chr1.fa.gz

That file is found in a different directory, same name though?

ADD REPLY
0
Entering edit mode

Thanks btw for your help

ADD REPLY
0
Entering edit mode

That file contains contigs, not chromosomes. Different things.

ADD REPLY
3
Entering edit mode
10.3 years ago
Neilfws 49k

The files are not different - they contain identical sequence and both begin with a run of 'N' characters.

For future reference, here's a quick way to count the bases:

# chr1
tail -n+2 chr1.fa | awk '{ for ( i=1; i<=length; i++ ) arr[substr($0, i, 1)]++ }END{ for ( i in arr ) { print i, arr[i] } }'
A 32485284
N 23970000
C 23064132
T 32559153
G 23070958
a 33085607
c 23960280
g 23945604
t 33109603

# hs_ref_GRCh37.p2_chr1
tail -n+2 hs_ref_GRCh37.p2_chr1.fa | awk '{ for ( i=1; i<=length; i++ ) arr[substr($0, i, 1)]++ }END{ for ( i in arr ) { print i, arr[i] } }'
A 65570891
N 23970000
C 47024412
G 47016562
T 65668756

If you do some summing, you'll see that both files contain exactly the same count for each base and the total length is 249,250,621.

ADD COMMENT
0
Entering edit mode

thank you, yes I realized I provided the wrong link. It looks like the file here ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/hs_ref_GRCh37.p2_chr1.fa.gz does not match the file here ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh37.p2_chr1.fa.gz

ADD REPLY
0
Entering edit mode
10.3 years ago
Allpowerde ★ 1.3k

There are some differences here and there (e.g. "chr"-omitted and the annotations seems to be 0-indexed vs. 1-indexed) but nothing as drastic as you showed. It is still a major pain though.

GATK has taken clear sides with making b37 their primary ref-genome but PLINK and GWAS study are hg19 based. So I guess we have to deal with this issue a little bit longer.

ADD COMMENT
0
Entering edit mode

thank you for your help

ADD REPLY

Login before adding your answer.

Traffic: 1994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6