Question

Problems with converting data

0

Entering edit mode

7.2 years ago

Roymond_olsen • 0

Hello,

I've been trying to solve this problem for a month now, so I thought it'd be time to ask for some help.

I've got a dataset that looks like this (anonymized with x):

ID  ID-87xxxxx  ID-88xxxxx  ID-87xxxxx  ID-96xxxxx                 

IndividualA     2   1   2   0         

IndividualB 1   1   1   1

IndividualC 0   2   2   0

IndividualD 0   0   0   1

IndividualE 1   1   1   2

IndividualF 1   1   2   1

IndividualG 2   0   1   0

IndividualH 1   1   0   1

The 0,1 and 2 depict zogysity. The columns represents a marker. For any marker an individual's genotype is codified as the count of the copies of the second allele, meaning:

        0: homozygote for the first allele
        1: heterozygote
        2: homozygote for the second allele
        5: Unknown

I have 55k+ SNPs, and several thousand individuals (with their own unique 14 character long code).

My questions:

What is the name of this type of data? (Is it allele count?)
How do I convert this kind of data into something else? I am going to use NeEstimator, Structure and other software, and none of them accepts this format. It would be great to convert it to a data type I can use to further convert it to what I need (I know GENEPOP does this well)
Is there any program that makes use of this format?

Thank you for reading, and for any help you may provide. I have tried looking for answers to these questions for a long time now.

SNP Conversion allele count zygosity • 1.8k views

ADD COMMENT • link updated 7.2 years ago by willgilks ▴ 360 • written 7.2 years ago by Roymond_olsen • 0

0

Entering edit mode

What is the name of this type of data? (Is it allele count?) How do I convert this kind of data into something else?

see Roslin Bioinformatics - Law's Laws : http://bioinformatics.roslin.ac.uk/lawslaws/

"The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats."

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Haha, I see I'm not the only one who's had to struggle with this. Thanks for the laugh though.

ADD REPLY • link 7.2 years ago by Roymond_olsen • 0

score 1 · Answer 1 · 2017-02-12

Hi Roymand,

I'd tenatively call that a genotype matrix, and say that it's in dosage format, in that the value in each cell indicate how many copies of the alternate allele each individual has. Plink 1.9 can read dosage format files as described here https://www.cog-genomics.org/plink2/assoc#dosage and can transform between different genotype data formats.

You could sed-convert 0 to 0/0, 1 to 0/1, 2 to 1/1, and 5 to ./. which is the makings of a vcf-format file though maybe not so useful. If you go on to replace the '/' with tab, then you have the makings of a plink ped (pedigree) file which is fairly universal. You might need to specifiy the alleles at some point.

Structure and Nestimator seem to have fairly custom but simple data input formats so maybe it's just a matter of playing around with bash to get your data right. R has good basic and packaged functions for matrix calculations and population genetics.