Question: Looking for human reference genome with ancestral SNP alleles
gravatar for Scott
22 months ago by
Scott80 wrote:

Hi. I am looking for reference human genome fasta files (preferably hg38) where the SNP alleles contributing to the reference sequence are always the ancestral alleles, not the "reference" alleles. Any insight into how reference alleles are assigned or became the reference alleles would also be appreciated.

My other option is to edit the SNP loci in the reference genome to the ancestral alleles fetched from the Ensembl variation database, but I thought I would check if the Fasta files I am looking for already exist. I know an hg38 with SNPs coded by the IUPAC ambiguity codes exists. I also came across a "common ancestor" (presumably between Chimps and Humans) build, but this is not exactly what I need.


ensembl snp hg38 assembly genome • 929 views
ADD COMMENTlink modified 15 months ago by kevin.esoh0 • written 22 months ago by Scott80
gravatar for Emily_Ensembl
22 months ago by
Emily_Ensembl21k wrote:

I don't know of a file like the one you say, but I can tell you about where the reference alleles came from. The reference alleles are whatever was in the individual who was sequenced to give that bit of genome. To make the reference genome, a bunch of people had their DNA extracted, cut into chunks and cloned into BACs (bacterial artificial chromosomes). The BACs were grown up in bacterial cultures and sequenced using Sanger sequencing. The BACs were then tiled together to create chromosomes, using sequence overlap and known gene positions. This was done with minimal overlap, so only one BAC is used to give the genome sequence at any given position. This means that every part of the genome is somebody's real genomic sequence. The reference allele at any position, therefore, is whatever that person happened to have. This is why some reference alleles are rare or private alleles – because the real person whose genome they came from had rare or private alleles.

ADD COMMENTlink written 22 months ago by Emily_Ensembl21k

Thanks, Emily!

This is roughly what I thought. I just wasn't sure if any work had gone into individual reference allele re-assignment based on SNP data in more recent builds such as hg38.

ADD REPLYlink written 22 months ago by Scott80

It has been improved in GRCh38, but it's still not perfect. At various points in GRCh38 you will see very small contigs (the bits of the BACs that were included in the genome) in the middle of a larger contig. This is where the GRC used 1000 Genomes data to identify that the reference allele was rare/private, so replaced part of the old contig they used with a new one which had the more common allele. This means that the reference allele is flipped compared to GRCh37. They did it all manually, so they didn't complete it and there are still many loci where the reference is rare/private.

ADD REPLYlink written 22 months ago by Emily_Ensembl21k

Sounds good. Thanks for the details. Appreciate it!

ADD REPLYlink written 22 months ago by Scott80
gravatar for kevin.esoh
15 months ago by
kevin.esoh0 wrote:

Very helpful thread! Thanks for the detail response. It has been particularly challenging working with genotype data in b37 coordinate due to this rare/ancestral allele assignment.

ADD COMMENTlink written 15 months ago by kevin.esoh0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1227 users visited in the last hour