Question: Plink: Proper file format for linkage disequilibirum
1
gravatar for JourneyToAbyss
7 months ago by
JourneyToAbyss60 wrote:

While I feel comfortable with parsing data, I am new to SNP data and having difficulty understanding the proper file formats to feed into plink. The SNP data comes from affy chips on human subjects.

My SNP data has the SNP-ID followed by a single numeric code.

rs3008282   1   1   0   1   1   0

I also have reference data for each SNP. The first column is the SNP-ID, the second column is the chromosome, the third column is the position, the fourth colum (e.g. A) is the "reference" and the last colum (e.g. T) is the "alternative."

rs3008282   8   174284  A   T

Question 1 I want to make sure I am clear on my understanding about this data. My interpretation is that the codes 1 and 0 correlate to the presence or absence of that particular SNP. For instance, the first individual is expressing the SNP, which has an "A" on position 174284 of chromosome 8. The "T" merely represents the parallel strand. The reason why I am asking this is because there are other reference SNPs that do not follow chargaff rule. For example:

rs17744517  8   182340  A   G

Thus I want to check my understanding in case I am misinterpreting this data.

Question 2 Harvard's plink v1.07 tutorial says the SNPs must be in biallelic form. For cog-genomic v1.90, I am not finding mention of that restriction. Additionally, I am having more difficulty interpreting the toy dataset than the one found on harvard's site.

Question 2A: What is the difference between position and base-pair coordinate in the .map file?

Question 2B: I am assuming the toy-dataset has 2 SNPs present. For individual 1, this will be (0 0 or 7th/8th column) and (1 1 or 9th/10th column)?

1 1000000000 0 0 1 1 0 0 1 1

Question 3 If data should be represented in a bi-allele fashion, should I simply assume both alleles are the same (e.g. copy the value) or code it 0 (e.g. missing value)?

I hope none of these questions were too stupid.

Thank you =)

plink • 280 views
ADD COMMENTlink modified 7 months ago by RamRS21k • written 7 months ago by JourneyToAbyss60
1
gravatar for chrchang523
7 months ago by
chrchang5234.9k
United States
chrchang5234.9k wrote:
  1. DNA has two strands, where A/T and C/G are always paired. It's redundant to explicitly track both strands, so the convention is to report just the base on the 5' -> 3' strand. So 'A' really means "A on 5' -> 3' strand, T on the opposite strand", and 'G' really means "G on 5' -> 3' strand, C on the opposite strand".

  2. (A) The plink 1.07 documentation and plink 1.90 documentation are describing the same thing here. (B) That's correct. "0 0" represents a missing genotype call. "1 1" is actually unrepresentative: you'll usually see something like "A A" or "A G" or "G G" here. This was a mistake on my part; I will change the toy dataset accordingly in the next plink 1.9 update.

  3. You should only write "0 0" when you don't know the genotype. When you do know it, write something like "A A", "A G", or "G G". (You'll need to know what the affy-chip file formats are to translate its 1s and 0s correctly.)

ADD COMMENTlink written 7 months ago by chrchang5234.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1282 users visited in the last hour