Question

Plink: Proper file format for linkage disequilibirum

1

Entering edit mode

6.8 years ago

JourneyToAbyss ▴ 250

While I feel comfortable with parsing data, I am new to SNP data and having difficulty understanding the proper file formats to feed into plink. The SNP data comes from affy chips on human subjects.

My SNP data has the SNP-ID followed by a single numeric code.

rs3008282   1   1   0   1   1   0

I also have reference data for each SNP. The first column is the SNP-ID, the second column is the chromosome, the third column is the position, the fourth colum (e.g. A) is the "reference" and the last colum (e.g. T) is the "alternative."

rs3008282   8   174284  A   T

Question 1 I want to make sure I am clear on my understanding about this data. My interpretation is that the codes 1 and 0 correlate to the presence or absence of that particular SNP. For instance, the first individual is expressing the SNP, which has an "A" on position 174284 of chromosome 8. The "T" merely represents the parallel strand. The reason why I am asking this is because there are other reference SNPs that do not follow chargaff rule. For example:

rs17744517  8   182340  A   G

Thus I want to check my understanding in case I am misinterpreting this data.

Question 2 Harvard's plink v1.07 tutorial says the SNPs must be in biallelic form. For cog-genomic v1.90, I am not finding mention of that restriction. Additionally, I am having more difficulty interpreting the toy dataset than the one found on harvard's site.

Question 2A: What is the difference between position and base-pair coordinate in the .map file?

Question 2B: I am assuming the toy-dataset has 2 SNPs present. For individual 1, this will be (0 0 or 7th/8th column) and (1 1 or 9th/10th column)?

1 1000000000 0 0 1 1 0 0 1 1

Question 3 If data should be represented in a bi-allele fashion, should I simply assume both alleles are the same (e.g. copy the value) or code it 0 (e.g. missing value)?

I hope none of these questions were too stupid.

Thank you =)

plink • 1.5k views

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 6.8 years ago by JourneyToAbyss ▴ 250

score 1 · Answer 1 · 2018-09-22

DNA has two strands, where A/T and C/G are always paired. It's redundant to explicitly track both strands, so the convention is to report just the base on the 5' -> 3' strand. So 'A' really means "A on 5' -> 3' strand, T on the opposite strand", and 'G' really means "G on 5' -> 3' strand, C on the opposite strand".
(A) The plink 1.07 documentation and plink 1.90 documentation are describing the same thing here. (B) That's correct. "0 0" represents a missing genotype call. "1 1" is actually unrepresentative: you'll usually see something like "A A" or "A G" or "G G" here. This was a mistake on my part; I will change the toy dataset accordingly in the next plink 1.9 update.
You should only write "0 0" when you don't know the genotype. When you do know it, write something like "A A", "A G", or "G G". (You'll need to know what the affy-chip file formats are to translate its 1s and 0s correctly.)