Tutorial: Produce PCA for 1000 Genomes Phase III in VCF format
gravatar for Kevin Blighe
16 days ago by
Kevin Blighe1.3k
Republic of Ireland
Kevin Blighe1.3k wrote:

My own process for producing a PCA from the 1000 Genomes Phase III is below. In R, you'll have to sort out the plot colours yourself. The PED file contains sample ID to ethnicity mappings.

The protocol changes if you want to merge your own data to the 1000 Genomes. I've done this many times and have developed my own predictive ethnic model (>99% sensitivity).

NB - requires at least plink 1.9

Download the files as VCF.gz (and tab-indices)

mkdir /YourDir/1000Genomes/ ;

cd /YourDir/1000Genomes/ ;

prefix="http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5/chr" ;

suffix=".1kg.phase3.v5.vcf.gz" ;

for chr in {1..22} X; do
    wget $prefix$chr$suffix $prefix$chr$suffix.tbi ;

Download 1000 Genomes PED file

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;

Download the GRCh37 1kg reference

cd /ReferenceMaterial/1000Genomes/ ;

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz ;

gunzip human_g1k_v37.fasta.gz ;

samtools faidx human_g1k_v37.fasta ;

Convert the 1000 Genomes files to BCF

  • Ensure that indels are left-aligned and that multi-allelic calls are split (1st command)
  • Check that the ref alleles match the ref genome FASTA (2nd command)
  • Finally set the ID column to be equal to it's current value, i.e., rs ID (3rd command)

ID can also be set to something unique like chr:pos:pos:ref:alt with -I +'%CHROM:%POS:%POS:%REF:%ALT'*

 for chr in {1..22} X; do

        bcftools norm -Ou -m-any /YourDir/1000Genomes/chr$chr.1kg.phase3.v5.vcf.gz | bcftools norm -Ou -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta | bcftools annotate -Ob -I +'%ID' > /YourDir/1000Genomes/chr$chr.1kg.phase3.v5.bcf ;

        bcftools index /YourDir/1000Genomes/chr$chr.1kg.phase3.v5.bcf ;

Convert the BCF files to PLINK format

cd /YourDir/1000Genomes/ ;

for chr in {1..22} X; do
    plink --noweb --bcf /YourDir/1000Genomes/chr$chr.1kg.phase3.v5.bcf --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out /YourDir/1000Genomes/chr$chr.1kg.phase3.v5 ;

Exclude variants not on the coding strand

NB - This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense)

Identify and remove any duplicate IDs (should not be an issue with using just the 1000 Genomes)

for chr in {1..22} X; do
    plink --noweb --bfile /YourDir/1000Genomes/chr$chr.1kg.phase3.v5 --list-duplicate-vars ;

    cut -f4 plink.dupvar | cut -f1 -d" " > tmpDups ;

    plink --noweb --bfile /YourDir/1000Genomes/chr$chr.1kg.phase3.v5 --exclude tmpDups --make-bed --out /YourDir/1000Genomes/DupsRemoved/chr$chr.1kg.phase3.v5 ;

    rm tmpDups ;

Prune variants from each chromosome

--maf 0.10, only retain SNPs with MAF greater than 10%
--indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]

e.g. indep 50 5 1.5, Generates a list of markers in approx. linkage equilibrium - takes 50 SNPs at a time and then shifts by 5 for the window. VIF (1/(1-r^2)) is the cut-off for linkage disequilibrium

for chr in {1..22} X; do
    plink --noweb --bfile /YourDir/1000Genomes/DupsRemoved/chr$chr.1kg.phase3.v5 --maf 0.20 --indep 50 5 1.5 --out /YourDir/1000Genomes/MAF/chr$chr.1kg.phase3.v5 ;

    plink --noweb --bfile /YourDir/1000Genomes/DupsRemoved/chr$chr.1kg.phase3.v5 --extract /YourDir/1000Genomes/MAF/chr$chr.1kg.phase3.v5.prune.in --make-bed --out /YourDir/1000Genomes/MAF/chr$chr.1kg.phase3.v5 ;

Get a list of all PLINK files

cd /YourDir/ ;

find /YourDir/ -name "*.bim" | grep -e "MAF" > /YourDir/ForMerge.list ;

sed -i 's/.bim//g' /YourDir/ForMerge.list ;

Merge all projects into a single PLINK file

plink --merge-list /YourDir/ForMerge.list --out /YourDir/Merge ;

Perform MDS (10 dimensions) and PCA

plink --bfile /YourDir/Merge --cluster --mds-plot 10

plink --bfile /YourDir/Merge --pca

Generate plots in R


options(scipen=100, digits=3)

#Read in the eigenvectors
eigen <- data.frame(read.table("plink.eigenvec", header=FALSE, skip=0, sep=" "))
rownames(eigen) <- eigen[,2]
eigen <- eigen[,3:ncol(eigen)]


#Determine the proportion of variance of each component
proportionvariances <- ((apply(eigen, 1, sd)^2) / (sum(apply(eigen, 1, sd)^2)))*100

plot(eigen[,1], eigen[,2])

legend("topleft", bty="n", cex=1.5, title="", c("African","Hispanic","East Asian","Caucasian","South Asian"), fill=c("yellow","forestgreen","grey","royalblue","black"))

Sample image

[need to manage colours and layout yourself] Screen

ADD COMMENTlink modified 11 days ago by alerodriguez40 • written 16 days ago by Kevin Blighe1.3k

Looks nice! Thanks for posting. Could I ask you to please include a PCA figure directly in your post?

ADD REPLYlink written 15 days ago by Kamil1.6k

Thanks for the suggestion. I have now added a sample figure plotting component 1 versus 2 and component 1 versus 3. The user will have to organise the colours and plot layout themselves!

ADD REPLYlink written 15 days ago by Kevin Blighe1.3k
gravatar for alerodriguez
11 days ago by
alerodriguez40 wrote:

Can you please show, how to match the sample IDs on a vcf file to IDs on a bed file?

bed 55 29 19 40 58 28 42 45
vcf 42_42 58_58 40_40 45_45 55_55 19_19 28_28 29_29


ADD COMMENTlink modified 11 days ago • written 11 days ago by alerodriguez40

That is a great question.

When you convert the VCF or BCF file to a PLINK BED file, PLINK does not always maintain the same ordering of samples as they appear in your VCF/BCF. It is good practice to always control the ordering using the --indiv-sort file parameter. You will also require a sample order file, which is just at list of the sample IDs as you would like them to be ordered:

plink --noweb --bcf MyData.bcf --keep-allele-order --indiv-sort file MySampleOrder.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyData ;

...where MySampleOrder.list is just a list of sample IDs in your order of preference:


Further information here: https://www.cog-genomics.org/plink/1.9/data#indiv_sort

ADD REPLYlink modified 10 days ago • written 10 days ago by Kevin Blighe1.3k

Thank you so much!! I ended up doing this in R, and it was a lot of work. Will follow your recommendation in the future.

ADD REPLYlink written 9 days ago by alerodriguez40

Great - glad to have helped. The next great thing to try would be to merge your own sample data (in VCF) and overlay the 1000 Genomes with them. By doing this, you an infer the ethnicity of your own data.

ADD REPLYlink written 9 days ago by Kevin Blighe1.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 821 users visited in the last hour