I am getting some raw LD data from plink using this line:
plink --bfile chr1file --recode A --chr 1 --from-bp 123456 --to-bp 987654 --maf 0.001 --out gene_locus_ld.txt
The output gene_locus_ld.txt.raw
file contains a layout for example like this:
FID IID PAT MAT SEX PHENOTYPE
1:123456:GC:G_GC 1:123454:T:TGTC_TGTC 1:12343:A:G_G 1:12345:A:G_G 1:1234:G:A_G
1:12345:G:T_T 1:50226471:G:A_A 1:123453:C:T_T 1:12341536:C:T_T
My question is, for each SNP ID like "1:12343:A:G_G" which letter out of the 3 here is the reference and which is the alternative allele? Is it the letters separated by ":" or the letters separated by "_"? So in this example would I take A:G
or G_G
?
I have read about the raw file in plink's documentation but I'm not sure if maybe the answer is there and I'm just not getting it as I don't have the same rsID output they outline:
.raw (additive + dominant component file)
Produced by "--recode A" and "--recode AD", for use with R. This format cannot be loaded by PLINK.A text file with a header line, and then one line per sample with V+6 (for "--recode A") or 2V+6 (for "--recode AD") fields, where V is the number of variants. The first six fields are:
FID Family ID IID Within-family ID PAT Paternal within-family ID MAT Maternal within-family ID SEX Sex (1 = male, 2 = female, 0 = unknown) PHENOTYPE Main phenotype value
This is followed by one or two fields per variant:
<Variant ID>_<counted allele> Allelic dosage (0/1/2/'NA' for diploid variants, 0/2/'NA' for haploid) <Variant ID>_HET Dominant component (1 = het, 0 otherwise). Requires "--recode AD".
If 'include-alt' was specified, the header line also names alternate allele codes in parentheses, e.g. 'rs5939319_G(/A)'.