Problems with converting data
Entering edit mode
7.1 years ago


I've been trying to solve this problem for a month now, so I thought it'd be time to ask for some help.

I've got a dataset that looks like this (anonymized with x):

ID  ID-87xxxxx  ID-88xxxxx  ID-87xxxxx  ID-96xxxxx                 

IndividualA     2   1   2   0         

IndividualB 1   1   1   1

IndividualC 0   2   2   0

IndividualD 0   0   0   1

IndividualE 1   1   1   2

IndividualF 1   1   2   1

IndividualG 2   0   1   0

IndividualH 1   1   0   1

The 0,1 and 2 depict zogysity. The columns represents a marker. For any marker an individual's genotype is codified as the count of the copies of the second allele, meaning:

        0: homozygote for the first allele
        1: heterozygote
        2: homozygote for the second allele
        5: Unknown

I have 55k+ SNPs, and several thousand individuals (with their own unique 14 character long code).

My questions:

  • What is the name of this type of data? (Is it allele count?)
  • How do I convert this kind of data into something else? I am going to use NeEstimator, Structure and other software, and none of them accepts this format. It would be great to convert it to a data type I can use to further convert it to what I need (I know GENEPOP does this well)
  • Is there any program that makes use of this format?

Thank you for reading, and for any help you may provide. I have tried looking for answers to these questions for a long time now.

SNP Conversion allele count zygosity • 1.8k views
Entering edit mode

What is the name of this type of data? (Is it allele count?) How do I convert this kind of data into something else?

see Roslin Bioinformatics - Law's Laws :

"The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats."

Entering edit mode

Haha, I see I'm not the only one who's had to struggle with this. Thanks for the laugh though.

Entering edit mode
7.0 years ago
willgilks ▴ 360

Hi Roymand,

I'd tenatively call that a genotype matrix, and say that it's in dosage format, in that the value in each cell indicate how many copies of the alternate allele each individual has. Plink 1.9 can read dosage format files as described here and can transform between different genotype data formats.

You could sed-convert 0 to 0/0, 1 to 0/1, 2 to 1/1, and 5 to ./. which is the makings of a vcf-format file though maybe not so useful. If you go on to replace the '/' with tab, then you have the makings of a plink ped (pedigree) file which is fairly universal. You might need to specifiy the alleles at some point.

Structure and Nestimator seem to have fairly custom but simple data input formats so maybe it's just a matter of playing around with bash to get your data right. R has good basic and packaged functions for matrix calculations and population genetics.

Entering edit mode

Thank you very much! I'll play around with plink and see what I can do. Have a nice day!


Login before adding your answer.

Traffic: 2276 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6