While I feel comfortable with parsing data, I am new to SNP data and having difficulty understanding the proper file formats to feed into plink. The SNP data comes from affy chips on human subjects.
My SNP data has the SNP-ID followed by a single numeric code.
rs3008282 1 1 0 1 1 0
I also have reference data for each SNP. The first column is the SNP-ID, the second column is the chromosome, the third column is the position, the fourth colum (e.g. A) is the "reference" and the last colum (e.g. T) is the "alternative."
rs3008282 8 174284 A T
Question 1 I want to make sure I am clear on my understanding about this data. My interpretation is that the codes 1 and 0 correlate to the presence or absence of that particular SNP. For instance, the first individual is expressing the SNP, which has an "A" on position 174284 of chromosome 8. The "T" merely represents the parallel strand. The reason why I am asking this is because there are other reference SNPs that do not follow chargaff rule. For example:
rs17744517 8 182340 A G
Thus I want to check my understanding in case I am misinterpreting this data.
Question 2 Harvard's plink v1.07 tutorial says the SNPs must be in biallelic form. For cog-genomic v1.90, I am not finding mention of that restriction. Additionally, I am having more difficulty interpreting the toy dataset than the one found on harvard's site.
Question 2A: What is the difference between position and base-pair coordinate in the .map file?
Question 2B: I am assuming the toy-dataset has 2 SNPs present. For individual 1, this will be (0 0 or 7th/8th column) and (1 1 or 9th/10th column)?
1 1000000000 0 0 1 1 0 0 1 1
Question 3 If data should be represented in a bi-allele fashion, should I simply assume both alleles are the same (e.g. copy the value) or code it 0 (e.g. missing value)?
I hope none of these questions were too stupid.
Thank you =)