Tutorial: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format
gravatar for Kevin Blighe
10 months ago by
Kevin Blighe24k
Republic of Ireland
Kevin Blighe24k wrote:

My own process for producing a PCA bi-plot from the 1000 Genomes Phase III is below. In R, you'll have to sort out the plot colours yourself. The PED file contains sample ID to ethnicity mappings.

The protocol changes if you want to merge your own data to the 1000 Genomes. I've done this many times and have developed my own predictive ethnic model (>99% sensitivity).

NB - requires at least plink 1.9

Download the files as VCF.gz (and tab-indices)

mkdir /YourDir/1000Genomes/ ;

cd /YourDir/1000Genomes/ ;

prefix="http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5/chr" ;

suffix=".1kg.phase3.v5.vcf.gz" ;

for chr in {1..22} X; do
    wget $prefix$chr$suffix $prefix$chr$suffix.tbi ;

Download 1000 Genomes PED file

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;

Download the GRCh37 1kg reference

cd /ReferenceMaterial/1000Genomes/ ;

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz ;

gunzip human_g1k_v37.fasta.gz ;

samtools faidx human_g1k_v37.fasta ;

Convert the 1000 Genomes files to BCF

  • Ensure that indels are left-aligned and that multi-allelic calls are split (1st command)
  • Check that the ref alleles match the ref genome FASTA (2nd command)
  • Finally set the ID column to be equal to it's current value, i.e., rs ID (3rd command)

ID can also be set to something unique like chr:pos:pos:ref:alt with -I +'%CHROM:%POS:%POS:%REF:%ALT'*

 for chr in {1..22} X; do

        bcftools norm -Ou -m-any /YourDir/1000Genomes/chr$chr.1kg.phase3.v5.vcf.gz | bcftools norm -Ou --check-ref w -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta | bcftools annotate -Ob -I +'%ID' > /YourDir/1000Genomes/chr$chr.1kg.phase3.v5.bcf ;

        bcftools index /YourDir/1000Genomes/chr$chr.1kg.phase3.v5.bcf ;

Convert the BCF files to PLINK format

cd /YourDir/1000Genomes/ ;

for chr in {1..22} X; do
    plink --noweb --bcf /YourDir/1000Genomes/chr$chr.1kg.phase3.v5.bcf --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out /YourDir/1000Genomes/chr$chr.1kg.phase3.v5 ;

Exclude variants not on the coding strand

NB - This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense)

Identify and remove any duplicate IDs (should not be an issue with using just the 1000 Genomes)

for chr in {1..22} X; do
    plink --noweb --bfile /YourDir/1000Genomes/chr$chr.1kg.phase3.v5 --list-duplicate-vars ;

    cut -f4 plink.dupvar | cut -f1 -d" " > tmpDups ;

    plink --noweb --bfile /YourDir/1000Genomes/chr$chr.1kg.phase3.v5 --exclude tmpDups --make-bed --out /YourDir/1000Genomes/DupsRemoved/chr$chr.1kg.phase3.v5 ;

    rm tmpDups ;

Prune variants from each chromosome

--maf 0.10, only retain SNPs with MAF greater than 10%
--indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]

e.g. indep 50 5 1.5, Generates a list of markers in approx. linkage equilibrium - takes 50 SNPs at a time and then shifts by 5 for the window. VIF (1/(1-r^2)) is the cut-off for linkage disequilibrium

for chr in {1..22} X; do
    plink --noweb --bfile /YourDir/1000Genomes/DupsRemoved/chr$chr.1kg.phase3.v5 --maf 0.20 --indep 50 5 1.5 --out /YourDir/1000Genomes/MAF/chr$chr.1kg.phase3.v5 ;

    plink --noweb --bfile /YourDir/1000Genomes/DupsRemoved/chr$chr.1kg.phase3.v5 --extract /YourDir/1000Genomes/MAF/chr$chr.1kg.phase3.v5.prune.in --make-bed --out /YourDir/1000Genomes/MAF/chr$chr.1kg.phase3.v5 ;

Get a list of all PLINK files

cd /YourDir/ ;

find /YourDir/ -name "*.bim" | grep -e "MAF" > /YourDir/ForMerge.list ;

sed -i 's/.bim//g' /YourDir/ForMerge.list ;

Merge all projects into a single PLINK file

plink --merge-list /YourDir/ForMerge.list --out /YourDir/Merge ;

Perform MDS (10 dimensions) and PCA

plink --bfile /YourDir/Merge --cluster --mds-plot 10

plink --bfile /YourDir/Merge --pca

Generate plots in R


options(scipen=100, digits=3)

#Read in the eigenvectors
eigen <- data.frame(read.table("plink.eigenvec", header=FALSE, skip=0, sep=" "))
rownames(eigen) <- eigen[,2]
eigen <- eigen[,3:ncol(eigen)]


#Determine the proportion of variance of each component
proportionvariances <- ((apply(eigen, 1, sd)^2) / (sum(apply(eigen, 1, sd)^2)))*100

plot(eigen[,1], eigen[,2])

legend("topleft", bty="n", cex=1.5, title="", c("African","Hispanic","East Asian","Caucasian","South Asian"), fill=c("yellow","forestgreen","grey","royalblue","black"))

Sample image

[need to manage colours and layout yourself] Screen

plink pca tutorial 1000 genomes vcf • 2.9k views
ADD COMMENTlink modified 4 days ago • written 10 months ago by Kevin Blighe24k

Can you please show, how to match the sample IDs on a vcf file to IDs on a bed file?

bed 55 29 19 40 58 28 42 45
vcf 42_42 58_58 40_40 45_45 55_55 19_19 28_28 29_29


ADD REPLYlink modified 10 months ago • written 10 months ago by DataFanatic100

That is a great question.

When you convert the VCF or BCF file to a PLINK BED file, PLINK does not always maintain the same ordering of samples as they appear in your VCF/BCF. It is good practice to always control the ordering using the --indiv-sort file parameter. You will also require a sample order file, which is just at list of the sample IDs as you would like them to be ordered:

plink --noweb --bcf MyData.bcf --keep-allele-order --indiv-sort file MySampleOrder.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyData ;

...where MySampleOrder.list is just a list of sample IDs in your order of preference:


Further information here: https://www.cog-genomics.org/plink/1.9/data#indiv_sort

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe24k

Thank you so much!! I ended up doing this in R, and it was a lot of work. Will follow your recommendation in the future.

ADD REPLYlink written 10 months ago by DataFanatic100

Great - glad to have helped. The next great thing to try would be to merge your own sample data (in VCF) and overlay the 1000 Genomes with them. By doing this, you an infer the ethnicity of your own data.

ADD REPLYlink written 10 months ago by Kevin Blighe24k

Looks nice! Thanks for posting. Could I ask you to please include a PCA figure directly in your post?

ADD REPLYlink written 10 months ago by Kamil1.8k

Thanks for the suggestion. I have now added a sample figure plotting component 1 versus 2 and component 1 versus 3. The user will have to organise the colours and plot layout themselves!

ADD REPLYlink written 10 months ago by Kevin Blighe24k

Hi, May I ask please, why is it necessary to filter by --maf 0.10, only retain SNPs with MAF greater than 10%? If you could point me to a good literary source I'd be grateful. I'm working with GTEx data and need to have PCs to be used as covariates in linear models. Thanks!

ADD REPLYlink written 6 months ago by DataFanatic100

If we just retain variants with low MAF, then every population (and sample) will be different because, by their very nature, low frequency variants are only present in a low number of samples. The algorithm 'fails' to find adequate relationships and then every sample is segregated from every other sample.

We want variants that are 'common' enough such that they are present across all groups, and we are interested in seeing how their genotypes change. Thus, the algorithm can find relationships and the different population groups become grouped together.

ADD REPLYlink written 6 months ago by Kevin Blighe24k

This was very helpful, thank you very much!

ADD REPLYlink written 6 months ago by DataFanatic100

How should one make use of the proportionvariances variable here? It seems that each sample has a corresponding value how can one summarize/interpret this result? Is it possible to plot? I have 200 samples.

ADD REPLYlink written 6 months ago by DataFanatic100

Thank you for the tutorial and it is really helpful. I would like to merge my data (in plink format) with the 1000 Genomes data and plot PCA. Should I merge the data after the plink format generated for 1000G? I would be grateful for your advice on this.

ADD REPLYlink written 3 months ago by lw53710

Hello, you're welcome. Previously, I have merged them after they have been converted to PLINK format and also after I had done some 'pruning' of variants independently on each dataset. I do not see why merging them prior to PLINK would make much difference; however, there would be differences when performing the pruning steps due to the allele frequency cut-offs.

It's certainly not a standardised pipeline, so, there are various ways of doing it.

ADD REPLYlink written 3 months ago by Kevin Blighe24k

Thank you very much for your advice.

ADD REPLYlink written 3 months ago by lw53710

Thanks for the great tutorial, it was very helpful!

Since I have microarray data (Illumina) that I need to analyze: could you give any more information how you would approach the step Exclude variants not on the coding strand?

ADD REPLYlink written 12 weeks ago by Alex S0

I see! In that case, what you need to do is the following:

  1. download the probe information for your microarray (you'll find it on the vendor's website, usually in TSV format)
  2. obtain a list (as a single column text file) of probes that are genotyped on the coding strand (+)
  3. use this list to filter include only these probes in your dataset. This can be done with the --extract PLINK command line parameter (see more info Here)
  4. proceed as normal from there


ADD REPLYlink written 12 weeks ago by Kevin Blighe24k

Thanks for your quick reply!

I would have one more question if you would humor me - since I am still kind of coping with the basics, i'm afraid it's a little basic.

Why do you constrain the data only to the coding strand? In Illumina GSA this would be entries with "+" in the 'RefStrand' column in the Manifest file? There is also the "IlmnStrand" column, which has TOP, BOT, PLUS, MINUS as values. As far as I understand it, the Illumina annotations for TOP, BOT strand (https://www.illumina.com/documents/products/technotes/technote_topbot.pdf) the RefStrand together with TOP, BOT indicate only with respect to which strand the SNP is given. Wouldn't I be throwing out more than half the SNPs in this way?

ADD REPLYlink written 12 weeks ago by Alex S0

I don't know of the specifics of the illumina annotation, i.e., which exact column it is - sorry. Each vendor produces different files. The reason why we do this is purely to match with the 1000 Genomes data, whose variants are all reported on the 'forward' strand. For your typical GWAS, you would not obviously throw away half of your dataset.

I put 'forward' in apostrophes here because there is confusion, generally, about how to interpret forward in relation to coding and non-coding.

Here is what it actually says on the 1000 Genomes pages:

All the variants in both our VCF files and on the browser are always reported on the forward strand.

Note the discrepancies, though: https://www.cell.com/trends/genetics/pdf/S0168-9525(12)00070-4.pdf

A useful idea is to try it with the non-coding strand variants included, and I'm sure that the results will be entirely unexpected when you view your PCA bi-plot

ADD REPLYlink written 12 weeks ago by Kevin Blighe24k

Ha, the paper is an amazing reference for this topic, thanks for your help!

By the way, I am using Snakemake to adapt/implement your tutorial above - I don't think you are, but if you're doing these workflows without using any workflow tool, check it out: http://snakemake.readthedocs.io/en/stable/tutorial/basics.html

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Alex S0

Thanks - I'm a computer scientist and biologist who has transitioned into bioinformatics. I have not used Snakemake but understand that it's popular!

ADD REPLYlink written 12 weeks ago by Kevin Blighe24k

Hi again Kevin, I have yet another question: On the final merge step (using data from the 1000g GRCH38 liftover release) I encountered an error due to multiallelic sites

Error: 7650 variants with 3+ alleles present

I tried

  • filtering to biallelic SNPs only in bcftools before anything in plink (using bcftools view -m2 -M2 -v snps file.vcf.gz)
  • Trying flip as explained in the plink manual
  • Trying to exclude those sites also as explained in the plink manual

Only the final step worked to get me a merge.

Has this ever occurred to you? Any idea why the bcf filtering could leave residual multiallelic?

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Alex S0

Hello again. Yes, this usually occurs. I probably should have added a note to the tutorial but I believe that I was at the limit in terms of characters per single post (6,000).

From my experience, the final step in your list is also the one that works. You could split these multi-allelic sites into multiple records with bcftools norm -m-any, but then you'd have another issue of having duplicate records, provided you used dbSNP rs IDs as the ID field. Note also that the same rs ID can refer to different sites (yes, this is true and dbSNP is aware of it).

A 'better' approach is to set the IF field in your VCF to a truly unique value by first splitting multi-allelic sites and then using bcftools annotate to set a unique value (to the ID field).

ADD REPLYlink written 11 weeks ago by Kevin Blighe24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 477 users visited in the last hour