Snp Genotype Data
3
2
Entering edit mode
13.1 years ago
Haluk ▴ 190

Hi,

I want to cluster HAPMAP project data using EIGENSTRAT. Currently, I have difficulties with creating genotype file. In the EIGENSTRAT manual, it says The genotype file contains 1 line per SNP. Each line contains 1 character per individual: 0 means zero copies of reference allele. 1 means one copy of reference allele. 2 means two copies of reference allele. 9 means missing data. In the following, it is one row of my huge data.

rs4475691 C/T chr1 836671 CT CC CC CC CC CT CC TT CC CC CC NN (and so on...)

1st column: snp id 2nd column: alleles 3rd column: chromosome 4th column: position

and the rest is patients genotype. I know NN is for missing data and it should be encoded as 9 according to EIGENSTRAT format, but I am not sure for CT, CC and TT.

Any help would be greatly appreciated.

snp genotyping • 8.6k views
ADD COMMENT
2
Entering edit mode
13.1 years ago
Genotepes ▴ 950

Hi

not sure of what you exactly need. Do you need a code to turn this into Eigenstrat ?

If the question is how to recode CC, TT and CT, then you choose one allele as the reference - you could choose the most frequent for instance or, here, take the first allele in the your line - I think, I do not know what format it is ...

CC = 0 CT = 1 TT = 2

Christian

ADD COMMENT
1
Entering edit mode

No, I am not looking for code. I didn't get the idea of behind the encoding genotypes as 0,1 or 2. For instance, why did you set 1 to genotype CT?

ADD REPLY
1
Entering edit mode

Yes? CT is set 1.

basically, 0 1 and 2 are the number of non-reference allele (something chosen arbitrary - could be the other allele) in the genotype. The idea is to create a "quantitative" trait for each SNP and apply a PCA-based analysis.

Hanif : you are right and I am sadly wrong. Was a "typo" in the sense that C is the reference allele.

Sorry about that - I am going to to vote a -1 for my message. On the other side, for the PCA and clustering here, I'd tend to say that the order is not so important - but better be straigth and do things

Christian

ADD REPLY
0
Entering edit mode

Actually since it's a C/T SNP, CC = 2, CT = 1, TT = 0, NN = 9

ADD REPLY
2
Entering edit mode
13.1 years ago

The answer from genotepes is fine. Hanif's comment is also OK, but we don't really know from the limited info which is the true reference allele and which is the derived.

In the case of EIGENSTRAT, any heterozygous genotype will be coded by 1 because it has one copy of the reference allele - and one copy of the derived.

ADD COMMENT
0
Entering edit mode

Actually I checked on UCSC and found C is .. But again I am not sure all the formats will be reference/alternative.

In most of the imputation-oriented formats I meet, as far as I remember, it is the cas.

ADD REPLY
0
Entering edit mode

Right, it should be that way. Sometimes, though, the alleles are simply listed alphabetically.

ADD REPLY
0
Entering edit mode
7.4 years ago
farid110ir • 0

Hello, Maybe my question is simple or silly question, but I need to ask you how should identify genotype of individuals. Actually, I am doing an association study, for that I have got sequence of each individual related to my desired gene. I have done SNP analysis and now I don’t know the next step in order to genotyping. Could you please assist me?

ADD COMMENT

Login before adding your answer.

Traffic: 2810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6