Looking for human reference genome with ancestral SNP alleles
3
1
Entering edit mode
5.4 years ago
Scott ▴ 110

Hi. I am looking for reference human genome fasta files (preferably hg38) where the SNP alleles contributing to the reference sequence are always the ancestral alleles, not the "reference" alleles. Any insight into how reference alleles are assigned or became the reference alleles would also be appreciated.

My other option is to edit the SNP loci in the reference genome to the ancestral alleles fetched from the Ensembl variation database, but I thought I would check if the Fasta files I am looking for already exist. I know an hg38 with SNPs coded by the IUPAC ambiguity codes exists. I also came across a "common ancestor" (presumably between Chimps and Humans) build, but this is not exactly what I need.

Thanks!

hg38 genome Assembly SNP Ensembl • 3.4k views
ADD COMMENT
4
Entering edit mode
5.4 years ago
Emily 23k

I don't know of a file like the one you say, but I can tell you about where the reference alleles came from. The reference alleles are whatever was in the individual who was sequenced to give that bit of genome. To make the reference genome, a bunch of people had their DNA extracted, cut into chunks and cloned into BACs (bacterial artificial chromosomes). The BACs were grown up in bacterial cultures and sequenced using Sanger sequencing. The BACs were then tiled together to create chromosomes, using sequence overlap and known gene positions. This was done with minimal overlap, so only one BAC is used to give the genome sequence at any given position. This means that every part of the genome is somebody's real genomic sequence. The reference allele at any position, therefore, is whatever that person happened to have. This is why some reference alleles are rare or private alleles – because the real person whose genome they came from had rare or private alleles.

ADD COMMENT
0
Entering edit mode

Thanks, Emily!

This is roughly what I thought. I just wasn't sure if any work had gone into individual reference allele re-assignment based on SNP data in more recent builds such as hg38.

ADD REPLY
1
Entering edit mode

It has been improved in GRCh38, but it's still not perfect. At various points in GRCh38 you will see very small contigs (the bits of the BACs that were included in the genome) in the middle of a larger contig. This is where the GRC used 1000 Genomes data to identify that the reference allele was rare/private, so replaced part of the old contig they used with a new one which had the more common allele. This means that the reference allele is flipped compared to GRCh37. They did it all manually, so they didn't complete it and there are still many loci where the reference is rare/private.

ADD REPLY
0
Entering edit mode

Sounds good. Thanks for the details. Appreciate it!

ADD REPLY
0
Entering edit mode
4.8 years ago
Esoh • 0

Very helpful thread! Thanks for the detail response. It has been particularly challenging working with genotype data in b37 coordinate due to this rare/ancestral allele assignment.

ADD COMMENT
0
Entering edit mode
2.5 years ago
ram.glez • 0

Hi Scott!

Did you ever found a file for the human ancestral alleles in hg38?

I can't seem to find one.

Cheers

ADD COMMENT

Login before adding your answer.

Traffic: 2937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6