Tutorial: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2
28
gravatar for Kevin Blighe
14 months ago by
Kevin Blighe51k
Kevin Blighe51k wrote:

Note1 - Previous version: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

Note2 - this data is for hg19 / GRCh37


The tutorial has been updated based on the 1000 Genomes Phase III imputed genotypes. The original tutorial was performed on non-imputed data held at the University of Washington, which is no longer accessible.

Other changes:

  • tutorial now entirely streamlined - all commands, including in R, are now included
  • duplicate variants are now removed with BCFtools, not PLINK (previous Step 6 removed)
  • now only performs PCA (originally, MDS was also performed but never used)
  • no longer using chrX variants (only autosomal variants)
  • new Step 3, indicating how to download the 1000 Genomes GRCh37 reference build

Program requirements:

  • plink > v1.9
  • BCFtools (tested on v 1.3)

Disk space requirements:

  • downloaded data (VCF.gz and tab-indices), ~ 15.5 GB
  • converted BCF files and their indices, ~14 GB
  • binary PLINK files, ~53 GB
  • pruned PLINK binary files, ~ <1 Gb

1, Download the files as VCF.gz (and tab-indices)

prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ;

suffix=".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" ;

for chr in {1..22}; do
    wget "${prefix}""${chr}""${suffix}" "${prefix}""${chr}""${suffix}".tbi ;
done

2, Download 1000 Genomes PED file

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;

3, Download the GRCh37 / hg19 reference genome

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz ;

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai ;

gunzip human_g1k_v37.fasta.gz ;

4, Convert the 1000 Genomes files to BCF

  • Ensure that multi-allelic calls are split and that indels are left-aligned compared to reference genome (1st pipe)
  • Sets the ID field to a unique value: CHROM:POS:REF:ALT (2nd pipe)
  • Removes duplicates (3rd pipe)

-I +'%CHROM:%POS:%REF:%ALT' means that unset IDs will be set to CHROM:POS:REF:ALT

-x ID -I +'%CHROM:%POS:%REF:%ALT' first erases the current ID and then sets it to CHROM:POS:REF:ALT

for chr in {1..22}; do
    bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
    ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \

    bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \

    bcftools norm -Ob --rm-dup both \
    > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

    bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;
done

5, Convert the BCF files to PLINK format

for chr in {1..22}; do
    plink --noweb --bcf ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf \
    --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed \
    --out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;
done

6, Exclude variants not on the coding strand

NB - This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense)

7, Prune variants from each chromosome

--maf 0.10, only retain SNPs with MAF greater than 10%
--indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]

e.g. indep 50 5 1.5, Generates a list of markers in approx. linkage equilibrium - takes 50 SNPs at a time and then shifts by 5 for the window. VIF (1/(1-r^2)) is the cut-off for linkage disequilibrium

mkdir Pruned ;

for chr in {1..22}; do
    plink --noweb --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes \
    --maf 0.10 --indep 50 5 1.5 \
    --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;

    plink --noweb --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes \
    --extract Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.prune.in --make-bed \
    --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;
done

8, Get a list of all PLINK files

find . -name "*.bim" | grep -e "Pruned" > ForMerge.list ;

sed -i 's/.bim//g' ForMerge.list ;

9, Merge all projects into a single PLINK file

plink --merge-list ForMerge.list --out Merge ;

10, Perform PCA

plink --bfile Merge --pca

11, Generate plots in R

R

options(scipen=100, digits=3)

# read in the eigenvectors, produced in PLINK
eigenvec <- read.table("plink.eigenvec", header = FALSE, skip=0, sep = " ")
rownames(eigenvec) <- eigenvec[,2]
eigenvec <- eigenvec[,3:ncol(eigenvec)]
colnames(eigenvec) <- paste("Principal Component ", c(1:20), sep = "")

# read in the PED data
PED <- read.table("20130606_g1k.ped", header = TRUE, skip = 0, sep = "\t")
PED <- PED[which(PED$Individual.ID %in% rownames(eigenvec)), ]
PED <- PED[match(rownames(eigenvec), PED$Individual.ID),]
all(PED$Individual.ID == rownames(eigenvec)) == TRUE
[1] TRUE

# set colours
require("RColorBrewer")

# from: http://www.internationalgenome.org/category/population/
PED$Population <- factor(PED$Population, levels=c(
        "ACB","ASW","ESN","GWD","LWK","MSL","YRI",
        "CLM","MXL","PEL","PUR",
        "CDX","CHB","CHS","JPT","KHV",
        "CEU","FIN","GBR","IBS","TSI",
        "BEB","GIH","ITU","PJL","STU"))

col <- colorRampPalette(c(
        "yellow","yellow","yellow","yellow","yellow","yellow","yellow",
        "forestgreen","forestgreen","forestgreen","forestgreen",
        "grey","grey","grey","grey","grey",
        "royalblue","royalblue","royalblue","royalblue","royalblue",
        "black","black","black","black","black"))(length(unique(PED$Population)))[factor(PED$Population)]

# generate PCA bi-plots
project.pca <- eigenvec
summary(project.pca)


par(mar=c(5,5,5,5), cex=2.0, cex.main=7, cex.axis=2.75, cex.lab=2.75, mfrow=c(1,2))

plot(project.pca[,1], project.pca[,2], type="n", main="A", adj=0.5, xlab="First component", ylab="Second component", font=2, font.lab=2)
points(project.pca[,1], project.pca[,2], col=col, pch=20, cex=2.25)
legend("bottomright", bty="n", cex=3.0, title="", c("African","Hispanic","East-Asian","Caucasian","South Asian"), fill=c("yellow","forestgreen","grey","royalblue","black"))

plot(project.pca[,1], project.pca[,3], type="n", main="B", adj=0.5, xlab="First component", ylab="Third component", font=2, font.lab=2)
points(project.pca[,1], project.pca[,3], col=col, pch=20, cex=2.25)

biplot

Kevin

ADD COMMENTlink modified 5 weeks ago • written 14 months ago by Kevin Blighe51k

Hi Kelvin, thank you very much for your tutorial. It helps me a lot to plot the PCA-bi plot of 1kg Phrase III samples.

My question is, if I would like to plot my own sample's PCA with 1kg sample's PCA (My samples were genotyped by SNP array. There were around 700k SNPs. I want to check whether my samples came from certain ethical group), shall I directly combine the eigenvec calculated from my sample and eigenvec got from your step 10 and then plot? Can the two eigen vectors comparable?

I ask this question because in my understanding, PC is the linear combination of independent variables. In my sample, PCs are the linear combination of the pruned genotyped SNPs whereas in 1kg samples, their PCs are the linear combination of other pruned (and much more) SNPs. I doubt whether I can directly compare the PCs. If I can't compare my sample's PCs with 1kg's in this way, how can I compare the ethical group of my sample to the reference 1kg samples?

Thank you very much for your possible reply in advance.

ADD REPLYlink written 13 months ago by Sunshine n Rain20

Hey, my apologies for I seem to have missed this. It is better to combine them within PLINK and then perform PCA on the combined dataset. I have done this many times and it usually works, in the sense that sample ethnicities line up with the expected group from 1000 Genomes.

ADD REPLYlink written 12 months ago by Kevin Blighe51k

Hi Kevin,

Does the following command do a liftover of 1000 genomes to Hg19. I need to do QC on 1000 genomes data before using for LD clumping and as I haven't done this before, I'm a bit unsure. I followed your steps until this point and then proceeded to convert to plink, removed SNPs based on call rate and HWE and will then proceed to use them for LD clumping

for chr in {1..22}; do bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \ ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \

bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' |

bcftools norm -Ob --rm-dup both \
> ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

done

Thanks

ADD REPLYlink modified 12 months ago • written 12 months ago by rkl0

Hey, I am not sure what you mean by 'liftover' in this context (?). If you want to do LD clumping, then you just need to complete steps 1-5. This will take ~1 day, though. The imputed 1000 Genomes data is 'quite' dense.

ADD REPLYlink written 12 months ago by Kevin Blighe51k

The commands that you have pasted into your comment do the following:

  • bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta - checks that the REF variants in the 1000 Genomes VCF files corresponds to the base in the reference genome, human_g1k_v37.fasta.
  • bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' - annotates the ID field in the 1000 Genomes VCF files to be, for example, chr1:12345645:A:T
  • bcftools norm -Ob --rm-dup both - the 1000 Genomes data contains duplicate variants, which have to be removed for PLINK
ADD REPLYlink written 12 months ago by Kevin Blighe51k

Hi Kelvin, thank you very much for your tutorial. It really helped me a lot.

I am trying to check that the 89 individuals I am working with are of European origin by comparing them with 1000 Genomes dataset. However, the data I have for my samples was obtained from targeted sequencing of 1,293 regions. My main concern is whether I have to extract only SNPs falling inside my targeted regions from the 1000 Genomes dataset before merging with my 89 samples or, whether I should use all the information from 1000 Genomes.

Thank you for your possible reply in advance.

ADD REPLYlink written 11 months ago by mpinsach0

Hey, I previously achieved this by building a predictive model from the 1000 Genomes Data and then applying it to my own data, which was also targeted sequencing. See here: A: How to predict individual ethnicity information by using hapmap data

The other way to do it is filter the datasets so that they have the same starting list of variants, and then proceed with the tutorial from part 7. For this to work, it would help if your variants are all labeled by rs ID or as chr:bp:ref:var

ADD REPLYlink written 11 months ago by Kevin Blighe51k

Thanks so much for this tutorial, Kevin !!

ADD REPLYlink written 9 months ago by Mr Locuace90

¡De nada!, Mr Locuace

ADD REPLYlink written 9 months ago by Kevin Blighe51k

Hey Kelvin,

How to plot sub-population ("ACB","ASW","ESN","GWD","LWK","MSL","YRI", "CLM","MXL","PEL","PUR", "CDX","CHB","CHS","JPT","KHV", "CEU","FIN","GBR","IBS","TSI", "BEB","GIH","ITU","PJL","STU") in pca plot instead of "African","Hispanic","East-Asian","Caucasian","South Asian"?

ADD REPLYlink written 6 months ago by Abdul Rafay Khan1.1k
1

Hey, you will simply have to modify the input colour vector. You will have to take great care to ensure that the colours line up to the data that is being plotted.

ADD REPLYlink written 6 months ago by Kevin Blighe51k

Hi Kevin, Just a small typo, it's actually:

gunzip human_g1k_v37.fasta.gz ;

ADD REPLYlink written 12 weeks ago by Raony Guimarães1.1k

Thanks! / Obrigado!

ADD REPLYlink written 11 weeks ago by Kevin Blighe51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1843 users visited in the last hour