Question: Problems with converting data
0
gravatar for Roymond_olsen
2.2 years ago by
Roymond_olsen0 wrote:

Hello,

I've been trying to solve this problem for a month now, so I thought it'd be time to ask for some help.

I've got a dataset that looks like this (anonymized with x):

ID  ID-87xxxxx  ID-88xxxxx  ID-87xxxxx  ID-96xxxxx                 

IndividualA     2   1   2   0         

IndividualB 1   1   1   1

IndividualC 0   2   2   0

IndividualD 0   0   0   1

IndividualE 1   1   1   2

IndividualF 1   1   2   1

IndividualG 2   0   1   0

IndividualH 1   1   0   1

The 0,1 and 2 depict zogysity. The columns represents a marker. For any marker an individual's genotype is codified as the count of the copies of the second allele, meaning:

        0: homozygote for the first allele
        1: heterozygote
        2: homozygote for the second allele
        5: Unknown

I have 55k+ SNPs, and several thousand individuals (with their own unique 14 character long code).

My questions:

  • What is the name of this type of data? (Is it allele count?)
  • How do I convert this kind of data into something else? I am going to use NeEstimator, Structure and other software, and none of them accepts this format. It would be great to convert it to a data type I can use to further convert it to what I need (I know GENEPOP does this well)
  • Is there any program that makes use of this format?

Thank you for reading, and for any help you may provide. I have tried looking for answers to these questions for a long time now.

ADD COMMENTlink modified 2.2 years ago by willgilks260 • written 2.2 years ago by Roymond_olsen0

What is the name of this type of data? (Is it allele count?) How do I convert this kind of data into something else?

see Roslin Bioinformatics - Law's Laws : http://bioinformatics.roslin.ac.uk/lawslaws/

"The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats."

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Pierre Lindenbaum119k

Haha, I see I'm not the only one who's had to struggle with this. Thanks for the laugh though.

ADD REPLYlink written 2.2 years ago by Roymond_olsen0
1
gravatar for willgilks
2.2 years ago by
willgilks260
United Kingdom
willgilks260 wrote:

Hi Roymand,

I'd tenatively call that a genotype matrix, and say that it's in dosage format, in that the value in each cell indicate how many copies of the alternate allele each individual has. Plink 1.9 can read dosage format files as described here https://www.cog-genomics.org/plink2/assoc#dosage and can transform between different genotype data formats.

You could sed-convert 0 to 0/0, 1 to 0/1, 2 to 1/1, and 5 to ./. which is the makings of a vcf-format file though maybe not so useful. If you go on to replace the '/' with tab, then you have the makings of a plink ped (pedigree) file which is fairly universal. You might need to specifiy the alleles at some point.

Structure and Nestimator seem to have fairly custom but simple data input formats so maybe it's just a matter of playing around with bash to get your data right. R has good basic and packaged functions for matrix calculations and population genetics.

ADD COMMENTlink written 2.2 years ago by willgilks260

Thank you very much! I'll play around with plink and see what I can do. Have a nice day!

ADD REPLYlink written 2.2 years ago by Roymond_olsen0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1386 users visited in the last hour