Question: PCA using HapMap population data
gravatar for eze.anokian
3.9 years ago by
eze.anokian10 wrote:

Hi there,

this is my first post at BioStars. I am a new starter bioinformatician. I have a problem that should be easy to solve but I cannot sort out, so I would be glad if you could help me with this issue.

The question is easy. I have a VCF file with genotype data of many samples. It contains SNPs in the rows and the columns are of a typical VCF (#CHROM,..., INFO) followed by the Ids of the samples. I would like to filter out the non-European samples according to these genotype data, using HapMap. I was told I had to do a PCA. I have tried several tools for this, Shellfish, Beagle, SNPRelate, but I could not solve the problem. With SNPRelate, I could do the PCA, but this just clusters samples that are unlabeled and I need to associate them to HapMap populations (CEU, YRI, JPT, CHB). On the other hand, Shellfish returns me a non-informative error when it is running:

Exception: command gtool -P --ped shellfish-temp-15479/146134504516.ped --map shellfish-temp-15479/ --og shellfish-temp-15479/146134504516.gen --os shellfish-temp-15479/146134504516.sample --discrete_phenotype 0 >> shellfish.log exited with code 256 (1)

And in file shellfish.log:

Note: No phenotypes present.
--recode to plink.ped + ... done.
Unknown parameter: 0

What steps and tools would you recommend to follow? I can use any tool you think is suitable for this.

Sorry for this, it may be an easy problem, but I have spent 2 days trying several tools.

Many thanks.

genotype pca vcf hapmap • 3.2k views
ADD COMMENTlink modified 9 months ago by zx87549.0k • written 3.9 years ago by eze.anokian10
gravatar for Floris Brenk
3.9 years ago by
Floris Brenk900
Floris Brenk900 wrote:


Dont have experience with those tools but I usually use plink.

First convert your vcf file to plink file using for example vcftools (

./vcftools --vcf input_data.vcf --plink --out output_in_plink

Then download the hapmap data from plink:

Then extract snps lists from both datasets and filter based on the these snplist to get only overlapping snps.

plink --bfile fileA--write-snplist --out list1 --noweb
plink --bfile fileB --extract list1 --noweb --out fileB_filtered --make-bed

Then merge the files and make a mds plot

plink --bfile fileA_filtered --bmerge fileB_filtered.bed fileB_filtered.bim fileB_filtered.fam --noweb --out merged --make-bed

plink --bfile merged --mds-plot 2 --noweb --out mds

This can be easily plotted using R or even Excel... and then you can see which samples are derived from which ethic background.

ADD COMMENTlink written 3.9 years ago by Floris Brenk900
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1747 users visited in the last hour