Question: plink PED file - randomly select allele at heterozygous site and convert to homozygous
gravatar for
2.1 years ago by
stolarek.ir550 wrote:

Hi all,

I am working with ancient DNA, and currently doing PCA analysis of my data. Everything seemed fine, till I did positive control with one low coverage sample, and it was placed somewhere completely out of common sense and its known origin.

I digged, that when working with data that has lots of missing SNPs, and generaly is of low coverage, I am supposed at heterozygous sites in my data set (for all individuals) in PED file (plink) to randomly select one of the alleles and make this site homozygous.

I see this point as my main deviation and possible explanation for what I see.

Has anyone already tackled this problem, or am I left with writing my own tool for this?

Kind regards,


genotype snp plink adna ped • 893 views
ADD COMMENTlink modified 15 months ago by zf1992lss0 • written 2.1 years ago by stolarek.ir550
gravatar for zf1992lss
15 months ago by
zf1992lss0 wrote:


Have your problem been solved? now I have the same problem. I am working with ancient DNA,too.Doing PCA analysis and structure analysis of my data (include 5 samples with low coverage),and they were placed somewhere out of common sense. Smartpca (ancient projection) could solved the problem ,but I do not know how to do .And how do you deal with the missing location(SNPs), directly delete them ? or change them to "zero" in PED file(PLINK)?

Please forgive my poor English and kind regards,


ADD COMMENTlink written 15 months ago by zf1992lss0

yea I wrote my own thing, gonna look for it and send it to you

2016-12-09 3:41 GMT+01:00 zf1992lss on Biostar

ADD REPLYlink written 15 months ago by stolarek.ir550
import random
import itertools
#infile = open("/data/workspace/PCA/human_origins/PUBLICATION_1/recode12/PCA_SMARTPCA_recode12_subset.ped","r")

#infile = open("/data/workspace/PCA/human_origins/PUBLICATION_1/IBD/test_keep_ind.ped","r")

#outfile = open("/data/workspace/PCA/human_origins/PUBLICATION_1/recode12/PCA_SMARTPCA_new_genotypes.ped","w")

check_11 = ('1','1')
check_22 = ('2','2')
def random_gen():
    choice_m = [check_11,check_22]
    res = random.choice(choice_m)
    return res

def parse_ped(ped_infile):
    check_12 = ('1','2')
    check_21 = ('2','1')

    for line in ped_infile:
        row = line.split()
        head = row[0:6]
        genotype_old = row[6:]
        genotype_f = row[6:]
        it = iter(genotype_f)
        genotype_t = zip(it, it)
        for n,i in enumerate(genotype_t):
            if i == check_12 or i == check_21:
        print genotype_t[0:10]
        genotype_t = list(itertools.chain(*genotype_t))
        print genotype_t[0:10]

        new_line = head+genotype_t
        outfile.write(" ".join(new_line)+"\n")


So this code works on ped file, when they are coded as 0,1,2. Mainly getting hets to homozygotes, doesn't really affect the results for PCA all that much, but for the formal statistics it is a desirable preprocessing step.

Most probably when your samples are completely nonsensical, it's either you have messed up the strandeness of the SNPs, so now nothing matches, or the low coverage causes sth. called "shrinkage problem". To work around shrinkage and have really good PCA, is to run the smartpca and projection using your high coverage reference samples, and when selecting samples to be projected, you select the low coverage ancient individuals (your samples) and also reference samples, for which you artifically reduce the number of SNPs <- this worked for my at least. For the missing SNPs, if they are only missing in your samples, cause the coverage sucks, well that's ok, you just need to run smartpca and do the projection. However if genotyping rate for those SNPs is generally low across your reference samples you are better off deleting them. And about ld prunning <- it turns out that it's better to have more SNPs, than to delete half of them, cause they are in LD.

ADD REPLYlink written 15 months ago by stolarek.ir550
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 960 users visited in the last hour