Tutorial: Which human reference genome should I use?
gravatar for finswimmer
6 days ago by
finswimmer6.1k wrote:

"Which human reference genome should I use?" is one of the top questions arising on biostars. On the first look the answer to this seems to be quiet simple: Take the most recent major version from the primary source, which is GRCh38 and you can obtain it on the website of the Genome Reference Consortium.

But before you blindly go on and download the genome file let's clear the following questions:

  1. What are reasons to take an older release?
  2. What sequence information are included in the reference fasta?
  3. What about the latest minor version?
  4. Which steps are neccessary to prepare the reference fasta for alignment?

1. What are reasons to take an older release?

Before starting check the databases you plan to use in your experiment. Are the data available for the latest assembly? How much effort do I need to put in to move those data to the latest version if they are not available yet?

2. What sequence information are included in the reference fasta?

The files for the latest major release are located here: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38

The sequences are in GCA_000001405.15_GRCh38_genomic.fna.gz

This file contains the following sequence information:

  1. chromosomes 1-22, X, Y
  2. sequences that can be asigned to one chromosome, but not to an exact position or orientation (unlocalized sequences)
  3. sequences that cannot be assigned to any chromosome (unplaced sequences)
  4. the mitochondrial genome
  5. sequences that provides an alternate representation of a locus (alernate locus)

(1)-(3) build together the so-called Primary Assembly.

3. What about the latest minor version?

You may realize that there are several minor versions indicate by .p<version-number>. The latest of these is GRCh38.p12. So why I did not link to this version?

In a minor version the sequences included in the major one stays untouched. Instead there were sequences added called patches (That's where the p comes from). These patches could be fixes to existing sequences and will be incorporate into the primary assembly in the next major release. Or they represent new sequences, which will be called "alternate loci" in the next major release.

More about patches you can read in this FAQ.

Unless we have very good reasons for it, we only want to use the sequences of the primary assembly and the mitochondrial genome for alignment. Why? Read the next chapter.

4. Which steps are neccessary to prepare the reference fasta for alignment?

The most things I describe in this part are inspired by a blog post by Heng Li (the guy who gave us well known tools like bwa and samtools).

The reference sequence I've linked above is not ready-to-use for an alignment, due to this three things:

  1. It includes alternate representation of loci. These are highly similar to those in the primary assembly. That's why most (if not all) aligners will not know where to place our reads and will give them a very low mapping quality (which is bad for variant calling).
  2. On the chromosome Y there is a pseudo-autosomal region (PAR), which is essentially a copy of a region located on chromosome X. Again our aligners will not know where to place the reads.
  3. The sequences have GenBank Accession Numbers like CM000663.2 as names instead of 1.

The ambigous mapping described in (1) is (2) is also valid for the patch sequences. This is why we don't need to take the latest minor version for alignment.

So what we have to do is:

  1. Build a subset of the sequences provided by GRC
  2. Mask the PAR on chromosome Y
  3. Rename the sequences

To fulfill these taskes we need some more files

Build a subset of the sequences provided by GRC

After downloading the reference sequence unpack it and index with samtools faidx:

$ samtools faidx GCA_000001405.15_GRCh38_genomic.fna

From GCA_000001405.15_GRCh38_assembly_report.txt we need to extract the names for the primary assemblies and the mitochondrial genome. A lot of bioinformatic tools expact that sequences are ordered. So we have to sort the current names by the sequence names we want to use later.

$ sort -k1,1V GCA_000001405.15_GRCh38_assembly_report.txt|awk -v FS="\t" '$8 == "Primary Assembly" || $8 == "non-nuclear" {print $5}' > subset_ids.txt

With this list we can build the subset of the sequences:

$ samtools faidx GCA_000001405.15_GRCh38_genomic.fna -r subset_ids.txt -o GRCh38_subset.fa

Mask the PAR on chromosome Y

(EDIT: I had to rewrite this paragraph. In the former version I told to us the par_align.gff directly for masking. This was wrong, as this files contains the PAR on the X chromosome.)

The par_align.gffcontains the information where the PAR is located on the X chromosome and where its equivalent is located on the Y chromosome. Open the file and look for the Target keyword in each line. The values are the position on the Y chromosome we like to mask. To do this, we must create a bed file with theses values (Caution: gff has 1-based positions, bed has 0-based). In our case it's looking like this:

$ cat parY.bed
CM000686.2  10000   2781479
CM000686.2  56887902    57217415

For the lazy ones here a one-liner to get the bedfile ;)

$ sed -E 's/.*Target=([^;]+).*/\1/g' par_align.gff|awk -v OFS="\t" '$0 !~ "^#" {print $1, $2-1, $3}'  > parY.bed

The masking can be done with bedtools:

$ bedtools maskfasta -fi GRCh38_subset.fa -bed parY.bed -fo GRCh38_subset_masked.fa

Rename the sequences

Currently the header of each sequence looks like this:


What we like to have is this:

>1 CM000663.2 NC_000001.11 chr1

Everything until the first whitespace will be the id. What comes after that is just additional information.

$ awk -v FS="\t" 'NR==FNR {header[">"$5] = ">"$1" "$5" "$7" "$10; next} $0 ~ "^>" {$0 = header[$0]}1' GCA_000001405.15_GRCh38_assembly_report.txt GRCh38_subset_masked.fa > GRCh38_alignment.fa


Now GRCh38_alignment.fa is ready for use in alignment. I would recommend index this file with samtools faidx, as there are a lot of tools out there which need it. Also you need to index it in a way that is described for the aligner you like use. E.g. for bwa type: bwa index GRCh38_alignment.fa

fin swimmer

ADD COMMENTlink modified 4 days ago • written 6 days ago by finswimmer6.1k

Thanks for the hard work finswimmer !

Would have be cool to have some words about GRC genome, Ensembl genome, NCBI genome... And talk about associated annotations which could be a bit different. Also chromsome names could be different (chr1, 1, NC_000001.11...)

Could I be wrong but, is not it dangerous to delete non-primary sequences from reference genome if you are interested in specific base information, like variant discovery. Because even if a sequence is a patch, it exists. Let's say you sequenced this patch sequence of chr6, if you remove non-primary sequences from your reference genome the read will map on the chr6 itself, which could lead to misscall some variants in the patch region.

Never heard that the sequence on chrY could be a problem, do you have some documentation related to this and what to do with this location ?

ADD REPLYlink modified 6 days ago • written 6 days ago by Bastien Hervé2.2k

Hello Bastien Hervé ,

I've done some comparison between the reference genome from GRC and ensembl. When doing such a comparison these are the points one should consider:

  1. Which sequences are included?
  2. How are the sequence named?
  3. Are any regions hard- (or soft) masked?

The latest reference sequences from ensembl are available here: ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/

The README describes what's the difference between the several files. I choose Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

First thing one should do, is to uncompress and index the file.

$ gunzip -c Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz > Homo_sapiens.GRCh38.dna.primary_assembly.fa
$ samtools index Homo_sapiens.GRCh38.dna.primary_assembly.fa

If you now take a look into the index file, you will find out that

  • It includes the sequences for chromosome 1-22, X, Y, MT, named without the prefix chr, ordered in lexicographically order.
  • These sequences ar followed by unplaced and unlocalized sequences named with their genebank accession number.

To check, wether there are sequence changes between the sequences provided by GRC und ensembl, I first sorted the sequences by sequence length and run than diff on them. The results are:

  • All bases in the ensembl file have capitial letters, whereas in the GRC file repeative sequences are soft-masked (meaning have lowercase).
  • The PAR on chromsome Y is hard-masked (meaning have N's)
  • Semi-ambiguous IUB codes were converted to “N” (these are very few, so this is a very minor issue)

In summary: The Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz provided by ensembl fulfill all criteria for a reference sequence for alignment I described in my initial post.

fin swimmer

ADD REPLYlink modified 1 day ago • written 1 day ago by finswimmer6.1k

Thanks finswimmer for the add on of the Ensembl reference genome.

The result of the diff command is interesting but i'm still not convinced by the mask of the PAR of the Y chromosome. If you want to do a counting on the PAR of the X chromosome for a male patient your results will be wrong. The real reads from the PAR of the Y chromosome will be force to align on the PAR of the X chromosome, which can lead to false positive counts

Also, I'm interested in your opinion about my second point (deleting non-primary sequences for SNP discovery) :)

ADD REPLYlink modified 1 day ago • written 1 day ago by Bastien Hervé2.2k


I don't forget your other questions. I'm working on it to provide the best possible answer ;)

ADD REPLYlink written 1 day ago by finswimmer6.1k

Why one shouldn't include ALT_Loci or Patch sequences in alignment reference?

As a rule of thumb: If you never heard about ALT_Loci or patch sequences, than it is very likely you don't need them :)

ALT_Loci sequences describes alternative sequence version for a specific region. There seems to be no clear definition, when it becomes an alternative (and get it's own identification number) and when it is just a variant and is therefor just reported in a variant database.

The sequences are often padded by long stretches of bases that are identical with the primary assembly. This is a problem when using a mapper/aligner that is not ALT_Loci aware. Reads will get a very low mapping quality as they can map to the primary assembly and the ALT_Loci. Saying this you are unable to do variant calling for this regions, as reads that can map to multiple position are usualy dropped.

There are ALT aware aligner out there, e.g. bwa. I've found a tutorial here. In the special case, you are interested in a loci, where you know there ALT sequences, one can use this. In the most cases this is not manageable, as there are no mainstream tools (that I be aware) for variant calling that be aware of the alternate loci.

All the things above are valid for patches as well. Here the differences to the primary assembly can be much smaller.

fin swimmer

ADD REPLYlink modified 5 hours ago • written 5 hours ago by finswimmer6.1k

Some typo :)

Part 3) The latest of these ist

Part4) Therefor most

ADD REPLYlink written 6 days ago by Bastien Hervé2.2k

Fixed :)

Concerning your questions: I'm short in time now. Will answers later (if no one else does ;) ) For the PAR you could start reading on wikipedia in the meantime.

ADD REPLYlink written 6 days ago by finswimmer6.1k

The "you could start reading on wikipedia" sound like "search on google yourself" to me :(

I don't really need information about the region, I'm just wondering why you mask this part and not all the duplicated regions of the human genome ?

ADD REPLYlink written 6 days ago by Bastien Hervé2.2k

Hello Bastien Hervé ,

I'm very sorry if it sounded that harsh. It wasn't meant this way. I really like your comments.

I especially mask this part because it's quite huge and it included genes (that are also of clinical interests). And in case of a female sample you can be sure that all reads have it's origin from chromosome X. So a low mapping quality because of the duplicate region on chromosome Y would be false.

fin swimmer

ADD REPLYlink modified 6 days ago • written 6 days ago by finswimmer6.1k

Maybe I'm diverging and nitpicking...

Let's imagine you got a female patient XY without SRY gene, it is very rare OK, but could append (here we are talking about gender on biostars :D)... I don't know if a karyotype is created before the acceptation of a patient. If we are not talking about human, I don't know if this event is frequent in mouse by exemple.

For full female patient OK you can mask this part of the Y (if we do not care about genetic anomaly). But for both gender or full man patient pool, if you mask the Y part, reads will map to the X PAR locus whereas half the reads could belongs to X and the other half to Y. If you don't want these reads you have to mask both location (PAR on X and Y).

Anyhow if I remember well, aligners do not output reads mapped on multiple positions, which is the case for reads in PAR locus.

ADD REPLYlink modified 6 days ago • written 6 days ago by Bastien Hervé2.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2200 users visited in the last hour