What is the relationship between PLink ped files and tped files
1
3
Entering edit mode
6.8 years ago
haohanw ▴ 90

I wonder what is the relationship between Plink .tped and .ped files. From what I observe, it seems it is more complicated than a simple transpose.

For example, in Section 4.1.1 of this manual, there is an example as following:

1     1     0     0     1     1     1     1     G     G
1     2     0     0     2     1     0     0     A     G
1     3     0     0     1     1     1     1     A     G
1     4     0     0     2     1     2     1     A     A


is transposed as

1     snp1     0     10001     1     1     0     0     1     1     2     1
1     snp2     0     20001     G     G     G     A     G     A     A     A
#                                          ^     ^     ^     ^


but instead of, what I thought should be:

1     snp1     0     10001     1     1     0     0     1     1     2     1
1     snp2     0     20001     G     G     A     G     A     G     A     A
#                                          ^     ^     ^     ^


Why there is a reverse relationship here?

And I think this reverse is not guaranteed to happen, for the reasons that in example of Section 3.4 of the same manual, it's hard to tell if there is any pattern for whether should be reversed or not.

(I am quite new to this area, and I hope the reason is not something very superficial as common sense in this domain)

plink SNP GWAS • 3.5k views
4
Entering edit mode
6.8 years ago

Interesting, I didn't know about that! Could it be that PLINK internally just sorts the alleles using some arbitrary rules?

I just ran a test with input alleles "G A", "A G" in various combinations with other SNPs and they always came out as "G A" in the transposed dataset.

Similarly, "G T", "T G" always becomes "G T", "G C", "C G" always becomes "G C" etc. "A T"/"T A" is always "A T", "A C"/"C A" becomes "A C", "G C"/"C G" becomes "G C". It can't be alphabetically sorted for obvious reasons.

The funny thing is, if I repeat the same thing using PLINK2, I get alphabetically sorted alleles: your example becomes G G A G A G A A (and my test-cases become alphabetically sorted, too). That makes me think that it's rather arbitrary and doesn't particularly matter.

Edit: I think it has to do with the way PLINK 1.07 stores genotypes as numbers - if you run

plink --file mytest --recode --transpose


you get the above inconsistent behaviour, but if you run

plink --file mytest --recode12 --transpose


so that all genotypes become numerically recoded, you'll always see "1 2" for all test cases, so these genotypes seem to be not alphabetically, but numerically sorted!