The objective is to calculate the genetic risk score, given the genotyping data, effective allele and effective size. Totally 1.5 million SNPs are included in the bfile.bim, and they all have variant ID with form Chr:BP; e.g., 1:12345. However, after I submit the code
/projects/bsi/gentools/bin/plink2 --bfile GenotypingData --score ScoreFile header sum --threads 6
I have the results:
FID IID PHENO CNT CNT2 SCORESUM
01 01 -9 3067556 2438692 9.411
02 02 -9 3067556 2440321 9.16466
03 03 -9 3067556 2440784 9.50342
04 04 -9 3067556 2443276 10.615
The question is why CNT is much smaller than 1.5 million? From Plink I know CNT is the #of nonmissing alleles used for scoring. The genotyping data has been QCed so no way so many alleles are missing.
What exactly is the problem here? CNT is about 1.5 million * 2 (the doubling is expected for diploid genomes).
Thanks for your answer. If muliplying 2 is the case, then it makes sense. However, CNT2 is much less than CNT, so does that mean there are a portion of snps are not named?
Say, as explained, CNT is about 2#SNP. However, for CNT2, how to explain the discrepancy between CNT2 and either #SNP or 2#SNP? I tested several subjects, and the missing rate of SNPs (unobserved SNPs from the naming list) is very low. For example, for all 1.5million SNPs interested, only ~20 SNPs are not observed in the genotyping data of subject 1.