It would be great if anybody here could help me understand AWClust's input data format. They have described it in the manual, but, I don't get it completely.
This is the description:
The first row in the SNP file is contains names or IDs of the individuals in the dataset separated by white space. Each subsequent row represents a single SNP and the different alleles each individual has for that SNP, also separated by white space. The SNP information is encoded as numeric values (i.e. 0, 1, or 2) to represent the number of variant SNP alleles in genotypes (i.e. 0 implies that there are no SNP variants in the genotype, 1 for heterozygotes and 2 for homozygotes for SNP variants), and -1 is used to represent missing values.
They also give a sample snippet of the data:
CEU1 CEU2 CEU3 CEU4 CEU5 1 0 1 0 1 1 1 2 0 2 -1 1 0 1 2 1 1 2 1 1
I get the encoding logic, but, each row representing a single SNP. Does that mean, I will get something like this for the example below. Where if I have a PED file like this :
FAM001 1 0 0 1 2 A A G G A C FAM002 2 0 0 1 2 A A A G 0 0
AWClust's input file format would be:
2 2 (for AA)
2 2 (for AA)
2 1 (for GG)
1 2 (for AG)
2 0 (for AC)
do we need another one for 00 as well ?
I would really appreciate it if anybody can explain this to me andd direct me if there's a known tool to convert a PED file to this format.