Question

How to interpret reference and alternative alleles from raw plink data?

0

Entering edit mode

2.3 years ago

genqs • 0

I am getting some raw LD data from plink using this line:

plink --bfile chr1file --recode A --chr 1 --from-bp 123456 --to-bp 987654 --maf 0.001 --out gene_locus_ld.txt

The output gene_locus_ld.txt.raw file contains a layout for example like this:

FID IID PAT MAT SEX PHENOTYPE 
1:123456:GC:G_GC 1:123454:T:TGTC_TGTC 1:12343:A:G_G 1:12345:A:G_G 1:1234:G:A_G 
1:12345:G:T_T 1:50226471:G:A_A 1:123453:C:T_T 1:12341536:C:T_T

My question is, for each SNP ID like "1:12343:A:G_G" which letter out of the 3 here is the reference and which is the alternative allele? Is it the letters separated by ":" or the letters separated by "_"? So in this example would I take A:G or G_G?

I have read about the raw file in plink's documentation but I'm not sure if maybe the answer is there and I'm just not getting it as I don't have the same rsID output they outline:

.raw (additive + dominant component file)
Produced by "--recode A" and "--recode AD", for use with R. This format cannot be loaded by PLINK.

A text file with a header line, and then one line per sample with V+6 (for "--recode A") or 2V+6 (for "--recode AD") fields, where V is the number of variants. The first six fields are:
FID   Family ID
IID   Within-family ID
PAT   Paternal within-family ID
MAT   Maternal within-family ID
SEX   Sex (1 = male, 2 = female, 0 = unknown)
PHENOTYPE Main phenotype value
This is followed by one or two fields per variant:
<Variant ID>_<counted allele> Allelic dosage (0/1/2/'NA' for diploid variants, 0/2/'NA' for haploid)
<Variant ID>_HET  Dominant component (1 = het, 0 otherwise). Requires "--recode AD".
If 'include-alt' was specified, the header line also names alternate allele codes in parentheses, e.g. 'rs5939319_G(/A)'.

plink GWAS LD genomics • 650 views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 2.3 years ago by genqs • 0