What is the difference between GRCh37 and hs37? And hg19?
3.2 years ago
juanfdelahoz ▴ 50

Hi! I've been struggling with the naming conventions of human reference genomes...

I know hg19 and GRCh37 are the same, but different names for each chromosome.

I know b37 is only the 25 longest sequences from GRCh37 (1-22,X,Y,MT)

I know we are now on the GRCh38 (or hg38) and we should be using that one.

However, for some reason, researchers in human genomes still use hg19...

Now, I found a reference called hs37 and I don't understand where it comes from. And there's not a single place where all this mess is explained. And all Heng Li says is: "If you map reads to GRCh37 or hg19, use hs37-1kg" : |

Other organisms have smaller communities and their genomes are better standardized, but humans... omg!

Thanks!

juanfdelahoz not looking for grammar correction, but can you change "hg37" to "hs37" in title and tags?

The title originally has hs37 that I changed to hg37. I've changed it back now.

This is also an insightful piece from Heng Li:

http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

3.2 years ago
GenoMax 107k

While some of this is confusing for someone starting out new there is order to the seemingly arcane nomenclature.

GRCh38/hg38 is the current release of the human genome. You should indeed be using this since it has been around for ~5 years at this point. You can find the data for it at NCBI's GRCh38 site.

GRCh37/hg37 is synonymous with hg19. You can find the information about this release at NCBI's GRCh37 site.

hs37 is a special genome reference prepared for 1000 genomes project by this method. You can find that data here.

Ultimately GENCODE is the organization project responsible for managing human/mouse genome data. They provide the authoritative genome data that is used by everyone including NCBI/UCSC/Ensembl.

I recall there was an extensive discussion on differences between GRCh37 and hg19 somewhere. Pierre was involved, I think.

0
Probably this is the sequence archive for hs ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/ and ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/

Ultimately GENCODE is the organization responsible for managing human/mouse genome data. They provide the authoritative genome data that is used by everyone including NCBI/UCSC/Ensembl.

I believe you mean the Genome Reference Consortium manages the human and mouse genome data. GENCODE is an annotation group at EBI and is not part of the GRC, although the EBI is a member.

Project is a better designation for GENCODE. Correction made above. GRC releases genome builds while annotation is produced by GENCODE project members.

Why is hg17, hg18, hg19 followed by hg38 and not "hg20" as one would expect?

hg19 is equivalent to GRCh37. I recall reading somewhere that they decided to unify the version numbers for hg and GRCh conventions, and so now it is hg38/GRCh38.

They should have gone one step further and unified the references as well!

There is only one reference sequence. There are annotations that come from different sources.

With graph based assemblies coming in near future reference sequences will gain a new complexity.

The hg names were created by UCSC and reflect the versions that were included in their browser. The correct names for the assemblies were designated by the creators and were always NCBI36, GRCh37 etc. When GRCh38 came out, UCSC agreed that their system of changing the assembly names was confusing, and decided to go with the correct numbering, but ultimately stuck with their hg prefixes.

I was under the impression there were slight differences. If it's just a different naming convention, are GRCh38 and hg38 interchangeable?

3.2 years ago
nikos.psonis ▴ 40

This is what I have found so far. Please correct me if I am wrong.

GRCh37 w/o patches includes the primary assembly (22 autosomal, X. Y, and non-chromosomal supecontigs) and alternate scaffolds, but not a reference mitogenome. Non-chromosomal supercontigs are the unlocalized and unplaced scaffolds.

The rCRS reference mitogenome in GRCh37 was included only after patch 2 (GRCh37.p2). This patch also included some fix and novel patches.

UCSC hg19 = GRCh37 w/o patches + African Yoruba mitogenome (not rCRS). Also UCSC hg19 has: Different naming conventions (e.g. chromosome X: chrX in UCSC vs. X in GRC). Different coordinate system (Start numbering a chromosome from 1 in UCSC vs. 0 in GRC).

Note also that Ion torrent uses a hg19 with replaced mitogenome (rCRS instead of Yoruba Sequence).

The b37 is hs37-1kg and does not include only the "25 longest sequences from GRCh37 (1-22,X,Y,MT)" but it is a 1000 Genome convention that includes: -The 24 "relatively complete" chromosomal sequences (named "1" to "22", "X" and "Y") downloaded individually from ENSEMBL. -The GRCh37.p2 (rCRS) mitochondrial sequence (named "MT") downloaded from MITOMAP or NCBI. -The unlocalized sequences, which were named after their accession numbers, such as "GL000191.1", "GL000194.1", etc. -The unplaced sequences, which were named after their accession numbers, such as "GL000211.1", "GL000241.1", etc. Only the alternate loci were not included in the b37 dataset.

hs37d5 (known also as b37 + decoy) was released by The 1000 Genomes Project (Phase II), which introduced additional sequence (BAC/fosmid clones, HuRef contigs, Epstein-Barr Virus genome) to the b37 reference to help reduce false positives for mapping. Note that this one uses the primary assembly of GRCh37.p4 (not the one of GRCh37 w/o patches).

As for hs37 (without -1kg) I think it is generated only by bwakit in BWA and according to their manual it corresponds to b37+EBV (Epstein-Barr Virus genome). EBV genome is also found in hs37d5 and GRCh38 and it is included because it is used in molecular biology for transformations and because it naturally infects B cells in ~90% of the world population.

There is no hg37.

Different coordinate system (Start numbering a chromosome from 1 in UCSC vs. 0 in GRC).

Do you mind explaining this more since I'm reading that UCSC uses 0-based internally and 1-based for its genome browser? Are 0-based/1-based coordinate systems inherent to the reference file, or just the methods of identifying regions in a reference? For instance, if I download chr1 of hg19, couldn't positions be referred to w/ either coordinate system?

\$ head references_hg19_chr1.fa
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
...


Great post by the way - very helpful :)