Genetic PCA from poolseq genotype file
4.1 years ago
AP ▴ 100

Hello,

I have a sync file extracted with Popoolation2 software that looks like that:

Contig    Position  Ref    Pool1           Pool2           Pool3           Pool4
SCAFOLD1    11722   A   330:0:0:0:0:0   315:0:0:0:0:0   334:0:0:0:0:0   111:0:0:0:0:0
SCAFOLD1    11723   T   0:330:0:0:0:0   0:316:0:0:0:0   0:334:0:0:0:0   0:111:0:0:0:0
SCAFOLD1    11725   T   0:327:0:0:0:0   0:314:0:0:0:0   0:329:0:0:0:0   0:111:0:0:0:0
SCAFOLD1    11726   A   330:0:0:0:0:0   314:0:0:0:0:0   332:0:0:0:0:0   111:0:0:0:0:0


Each cell contain the allelic counts for each basis (e.g. 330:0:0:0:0:0 for A:T:C:G:N).

I would like to perform a genetic PCA on this dataset just as one would do it on a 012 file extracted with VCFtools. I guess, one could convert the sync file with a single value per cell by adding the total number of non-reference alleles and work from that.

Does anybody have experience with that? Any opinion/comment would be very helpful.

Thanks!

Hi, did you find out how to perform the PCA? I also obtained a sync file using popoolations2 and a VCF using GATK and I was trying to perform a PCA using either file... but no success yet. Thank you,

Natalia

I managed following your method. Thanks a million!

2.8 years ago
AP ▴ 100

Hi Natalia,

Yes, I did manage to run a PCA using the sync file. The way I did it was to first calculate the frequency of the minor allele (or the major) of all the SNPs. Then, I ran a PCA on R using prcomp. Instead of the frequency, you can also just use the total count of the minor or major allele. You can also do the same on a 012 file.

Hope that helps! AP