Question

Understanding AWClust's SNP file data format and converting a PED file to it

0

Entering edit mode

8.8 years ago

aritra90 ▴ 70

Hi,

It would be great if anybody here could help me understand AWClust's input data format. They have described it in the manual, but, I don't get it completely.

This is the description:

The first row in the SNP file is contains names or IDs of the individuals in the dataset separated by white space. Each subsequent row represents a single SNP and the different alleles each individual has for that SNP, also separated by white space. The SNP information is encoded as numeric values (i.e. 0, 1, or 2) to represent the number of variant SNP alleles in genotypes (i.e. 0 implies that there are no SNP variants in the genotype, 1 for heterozygotes and 2 for homozygotes for SNP variants), and -1 is used to represent missing values.

They also give a sample snippet of the data:

CEU1 CEU2 CEU3 CEU4 CEU5
1 0 1 0 1
1 1 2 0 2
-1 1 0 1 2
1 1 2 1 1

I get the encoding logic, but, each row representing a single SNP. Does that mean, I will get something like this for the example below. Where if I have a PED file like this:

FAM001  1  0 0  1  2  A A  G G  A C
FAM002  2  0 0  1  2  A A  A G  0 0

AWClust's input file format would be:

FAM001 FAM002
2 2 (for AA)
2 2 (for AA)
2 1 (for GG)
1 2 (for AG)
2 0 (for AC)

Do we need another one for 00 as well?

I would really appreciate it if anybody can explain this to me and direct me if there's a known tool to convert a PED file to this format.

Thanks

plink allele SNP ped • 2.7k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.8 years ago by aritra90 ▴ 70