Question

Inferring Local Ethnicity of the Reference Genome

0

Entering edit mode

7.6 years ago

anovak ▴ 110

I'm doing a project on reference bias, where it would be convenient to be able to assign regions of the GRCh38 reference assembly to the 1000 Genomes superpopulation(s) that they most closely resemble. If I could do this, I would be able to look for mapping bias in each region towards the superpopulation(s) that the assembled sequence for the region was most representative of.

I've been told there is existing work out there that has gone through and made ethnicity or source population inferences about different parts of the reference assembly. Has anyone seen such a paper? I can't find anything on Google Scholar. I can find references saying that the majority of the assembly is from RP11, who we think is of African-American ancestry (and probably would have been placed in the AFR superpopulation, if "Americans of African Ancestry in Buffalo, New York" was a population covered by 1000 Genomes), but for the parts that aren't RP11 clones, I don't have any information.

Would I be better off using something like STRUCTURE to try and pull out actual shared haplotypes between the reference and the 1000 Genomes samples?

ethnicity inferrence population reference • 1.9k views

ADD COMMENT • link 18 months ago by anovak ▴ 110

score 0 · Answer 1 · 2022-10-25

I no longer think this makes any sense to try. People not in any of the 1000 Genomes population categories aren't properly in their superpopulation categories either, so any kind of inferred painting of the assembly wouldn't mean much. Then the superpopulations also don't differ enough in allele frequencies to have much mapping bias between them. Really the whole problem makes the most sense at the variant level, since mapping bias is towards the represented allele and away from the unrepresented alleles. Instead of trying to paint the chromosome, you want to be plotting read support for reference and non-reference alleles as a function of indel length.