plink PED file - randomly select allele at heterozygous site and convert to homozygous
1
1
Entering edit mode
7.0 years ago
stolarek.ir ▴ 690

Hi all,

I am working with ancient DNA, and currently doing PCA analysis of my data. Everything seemed fine, till I did positive control with one low coverage sample, and it was placed somewhere completely out of common sense and its known origin.

I digged, that when working with data that has lots of missing SNPs, and generaly is of low coverage, I am supposed at heterozygous sites in my data set (for all individuals) in PED file (plink) to randomly select one of the alleles and make this site homozygous.

I see this point as my main deviation and possible explanation for what I see.

Has anyone already tackled this problem, or am I left with writing my own tool for this?

Kind regards,

me

1
Entering edit mode
6.1 years ago
zf1992lss ▴ 10

hi

Have your problem been solved? now I have the same problem. I am working with ancient DNA,too.Doing PCA analysis and structure analysis of my data (include 5 samples with low coverage),and they were placed somewhere out of common sense. Smartpca (ancient projection) could solved the problem ,but I do not know how to do .And how do you deal with the missing location（SNPs）, directly delete them ? or change them to "zero" in PED file（PLINK）?

Please forgive my poor English and kind regards,

me

0
Entering edit mode

yea I wrote my own thing, gonna look for it and send it to you

2016-12-09 3:41 GMT+01:00 zf1992lss on Biostar mailer@biostars.org:

0
Entering edit mode
import random
import itertools
#infile = open("/data/workspace/PCA/human_origins/PUBLICATION_1/recode12/PCA_SMARTPCA_recode12_subset.ped","r")

#infile = open("/data/workspace/PCA/human_origins/PUBLICATION_1/IBD/test_keep_ind.ped","r")

#outfile = open("/data/workspace/PCA/human_origins/PUBLICATION_1/recode12/PCA_SMARTPCA_new_genotypes.ped","w")

check_11 = ('1','1')
check_22 = ('2','2')
def random_gen():
choice_m = [check_11,check_22]
res = random.choice(choice_m)
return res

def parse_ped(ped_infile):
check_12 = ('1','2')
check_21 = ('2','1')

for line in ped_infile:
row = line.split()
genotype_old = row[6:]
genotype_f = row[6:]
it = iter(genotype_f)
genotype_t = zip(it, it)
for n,i in enumerate(genotype_t):
if i == check_12 or i == check_21:
genotype_t[n]=random_gen()
print genotype_t[0:10]
genotype_t = list(itertools.chain(*genotype_t))
print genotype_t[0:10]

outfile.write(" ".join(new_line)+"\n")
outfile.close()

parse_ped(infile)


So this code works on ped file, when they are coded as 0,1,2. Mainly getting hets to homozygotes, doesn't really affect the results for PCA all that much, but for the formal statistics it is a desirable preprocessing step.

Most probably when your samples are completely nonsensical, it's either you have messed up the strandeness of the SNPs, so now nothing matches, or the low coverage causes sth. called "shrinkage problem". To work around shrinkage and have really good PCA, is to run the smartpca and projection using your high coverage reference samples, and when selecting samples to be projected, you select the low coverage ancient individuals (your samples) and also reference samples, for which you artifically reduce the number of SNPs <- this worked for my at least. For the missing SNPs, if they are only missing in your samples, cause the coverage sucks, well that's ok, you just need to run smartpca and do the projection. However if genotyping rate for those SNPs is generally low across your reference samples you are better off deleting them. And about ld prunning <- it turns out that it's better to have more SNPs, than to delete half of them, cause they are in LD.