Programming Challenge: Quickest Way To Determine The "Superpopulation" From A Vcf?
1
4
Entering edit mode
8.8 years ago

Given an exome or targeted human VCF of one or more samples, I need a program to determine the "superpopulation" of each sample, as listed here:

ASN EUR AFR AMR SAN


The program should return a single three letter code for each sample.

Submissions will be judged on speed using 10 randomly selected subsets of 1KG samples - you cannot count on any "crucial" regions being covered.

Each "miss" will result in a penalty that is effectively 50% of the best time for the next best tier (a miss of one call will tack on half the entire time it took to call all 10 correctly)

vcf • 2.2k views
0
Entering edit mode

So what am I allowed, if I cannot count on any specific region being there? How targeted could it be? Clearly some target regions will be uninformative...

0
Entering edit mode

sometimes we receive targeted resequencing samples that are, for example, just a bunch of cardiac genes. I would still like to make a guess as to the superpopulation.

0
Entering edit mode
8.8 years ago

Actually, my previous comment is an attempt at an answer, so here it is (will try and delete the comment):

Well, in general I would expect it not always to be possible. I would take the 1000Genomes SNP calls in your gene(s), and do a PCA (colouring each sample by population), and see if the super populations are evident in the PCA. If yes, it's very cheap to do a quick PCA for your sample and see where it lies compared with the 1000G populations. That's what I'd do, but I'm not an expert on that type of thing!

0
Entering edit mode

Automate and implement

1
Entering edit mode

You're quite right, you asked for a program, not description of how to do it. I don't have time to do this now though, so I'll bow out of the rest of this discussion