Programming Challenge: Quickest Way To Determine The "Superpopulation" From A Vcf?
8.8 years ago

Given an exome or targeted human VCF of one or more samples, I need a program to determine the "superpopulation" of each sample, as listed here:

ASN EUR AFR AMR SAN


The program should return a single three letter code for each sample.

Submissions will be judged on speed using 10 randomly selected subsets of 1KG samples - you cannot count on any "crucial" regions being covered.

Each "miss" will result in a penalty that is effectively 50% of the best time for the next best tier (a miss of one call will tack on half the entire time it took to call all 10 correctly)

So what am I allowed, if I cannot count on any specific region being there? How targeted could it be? Clearly some target regions will be uninformative...

sometimes we receive targeted resequencing samples that are, for example, just a bunch of cardiac genes. I would still like to make a guess as to the superpopulation.

8.8 years ago

Actually, my previous comment is an attempt at an answer, so here it is (will try and delete the comment):

Well, in general I would expect it not always to be possible. I would take the 1000Genomes SNP calls in your gene(s), and do a PCA (colouring each sample by population), and see if the super populations are evident in the PCA. If yes, it's very cheap to do a quick PCA for your sample and see where it lies compared with the 1000G populations. That's what I'd do, but I'm not an expert on that type of thing!

Automate and implement

You're quite right, you asked for a program, not description of how to do it. I don't have time to do this now though, so I'll bow out of the rest of this discussion