What does NA imply in a plink.raw file for genotype calling?
1
0
Entering edit mode
3.0 years ago

Hi, I used a filtered SNP vcf to run plink using the --recodeAD, --doubleid (my sample names have underscores) and --allow-extra-chr (the name of my chr was not being recognized). My plink.raw file has not only 0, 1, 2 but also many NAs in the samples, distributed randomly. I referred to the manual too, but cant figure out what it implies. It looks something like this:

IND1 0 0 1 0 0 0
IND2 0 0 NA NA 0 0
IND3 0 1 1 0 0 0


This is the relevant part in the manual:

**Using the --recodeAD option generates the file plink-recode.raw:
FID IID PAT MAT SEX PHENOTYPE snp1_2 snp1_HET snp2_G snp2_HET
1 1 0 0 1 1  0  0   2 0
1 2 0 0 2 1  NA NA  1 1
1 3 0 0 1 1  0  0   1 1
1 4 0 0 2 1  1  1   0 0
The column labels reflect the snp name (e.g. snp1) with the name of the minor allele appended (i.e. snp1_2 in the first instance, as 2 is the minor allele) for the additive component. The dominant component ( a dummy variable reflecting heterozygote state) is coded with the _HET suffix.
This file can be easily loaded into R: for example:
For example, for the first SNP, the individuals are coded 1/1, 0/0, 1/1 and 2/1. The additive count of the number of common (1) alleles is therefore: 2, NA, 2 and 1, which is reflected in the field snp1_2. The field snp1_HET is coded 1 for the fourth individual who is heterozygous -- this field can be used to model dominance effect of the allele.
The behavior of the --recodeA and --recodeAD commands can be changed with the --recode-allele command. This allows for the 0, 1, 2 count to reflect the number of a pre-specified allele type per SNP, rather than the number of the minor allele. This command takes as a single argument the name of a file that lists SNP name and allele to report, e.g. if the file recode.txt contained
snp1   1
snp2   A
then
would now report in the LOG file
Reading allele coding list from [ recode.txt ]
Read allele codes for 2 SNPs
FID IID PAT MAT SEX PHENOTYPE snp1_1 snp1_HET snp2_A snp2_HET
1 1 0 0 1 1   2  0   0 0
1 2 0 0 2 1   NA NA  1 1
1 3 0 0 1 1   2  0   1 1
1 4 0 0 2 1   1  1   2 0
If the SNP is monomorphic, by default the allele code out will be 0 and all individuals will have a count of 0 (or NA). If an allele is specified in --recode-allele that is not seen in the data, similarly all individuals will receive a 0 count (i.e. rather than an error being given).**


The ultimate aim is to generate a genotype heat map in R for the genotype (Het and hom) at each SNP position for all the individuals. I'm completely new to analysis. It would be great if I can get some help on this asap- whether it is correct and how it can be modified. and what the NAs mean. Thanks a lot.

SNP • 1.4k views
0
Entering edit mode
0
Entering edit mode
2.9 years ago
irlyn • 0

Although for you it might be too late to answer, but I had the similar problem with --recode option and I found out for my case that it was due to IMPUTE2 output that has probabilities for genotype. So converting oxford format into binary plink considers genotype probabilities uncertainly greater than 0.1 as missing genotype and the rest are treated as hard calls.

0
Entering edit mode

Yes, it is necessary to use plink 2.0 and --make-pgen/--pfile if you want to keep track of decimal values and export them with "--recode AD".