Tutorial:Produce PCA bi-plot for 1000 Genomes Phase III - Version 2
0
70
Entering edit mode
2.9 years ago

Note1 - Previous version: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

Note2 - this data is for hg19 / GRCh37

Note3 - GRCh38 data is available HERE

The tutorial has been updated based on the 1000 Genomes Phase III imputed genotypes. The original tutorial was performed on non-imputed data held at the University of Washington, which is no longer accessible.

Other changes:

• tutorial now entirely streamlined - all commands, including in R, are now included
• duplicate variants are now removed with BCFtools, not PLINK (previous Step 6 removed)
• now only performs PCA (originally, MDS was also performed but never used)
• no longer using chrX variants (only autosomal variants)
• new Step 3, indicating how to download the 1000 Genomes GRCh37 reference build

Program requirements:

• plink > v1.9
• BCFtools (tested on v 1.3)

Disk space requirements:

• downloaded data (VCF.gz and tab-indices), ~ 15.5 GB
• converted BCF files and their indices, ~14 GB
• binary PLINK files, ~53 GB
• pruned PLINK binary files, ~ <1 Gb

# 1, Download the files as VCF.gz (and tab-indices)

prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ;

suffix=".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz" ;

for chr in {1..22}; do
wget "${prefix}""${chr}""${suffix}" "${prefix}""${chr}""${suffix}".tbi ;
done


# 2, Download 1000 Genomes PED file

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;


# 3, Download the GRCh37 / hg19 reference genome

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz ;

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai ;

gunzip human_g1k_v37.fasta.gz ;


NB - if wget is not working, try curl:

curl ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz -O human_g1k_v37.fasta.gz


# 4, Convert the 1000 Genomes files to BCF

• Ensure that multi-allelic calls are split and that indels are left-aligned compared to reference genome (1st pipe)
• Sets the ID field to a unique value: CHROM:POS:REF:ALT (2nd pipe)
• Removes duplicates (3rd pipe)

-I +'%CHROM:%POS:%REF:%ALT' means that unset IDs will be set to CHROM:POS:REF:ALT

-x ID -I +'%CHROM:%POS:%REF:%ALT' first erases the current ID and then sets it to CHROM:POS:REF:ALT

for chr in {1..22}; do
bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz | \ bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \ bcftools norm -Ob --rm-dup both \ > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf ;

bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf ; done  # 5, Convert the BCF files to PLINK format for chr in {1..22}; do plink --noweb \ --bcf ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes ; done  # 6, Exclude variants not on the coding strand NB - This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense) # 7, Prune variants from each chromosome --maf 0.10, only retain SNPs with MAF greater than 10% --indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]  e.g. indep 50 5 1.5, Generates a list of markers in approx. linkage equilibrium - takes 50 SNPs at a time and then shifts by 5 for the window. VIF (1/(1-r^2)) is the cut-off for linkage disequilibrium mkdir Pruned ; for chr in {1..22}; do plink --noweb \ --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes \
--maf 0.10 --indep 50 5 1.5 \
--out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes ; plink --noweb \ --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes \
--extract Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.prune.in \ --make-bed \ --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes ;
done


# 8, Get a list of all PLINK files

find . -name "*.bim" | grep -e "Pruned" > ForMerge.list ;

sed -i 's/.bim//g' ForMerge.list ;


# 9, Merge all projects into a single PLINK file

plink --merge-list ForMerge.list --out Merge ;


## NB - if you have your own data that you want to merge with 1000 Genomes

Process the 1000 Genomes data as per this tutorial from Steps 1-9. In this way, you will have already identified the population-specific variants / markers. Then, after Step 9, do

• find common variants between your dataset and the merged 1000 Genomes dataset (and filter both for these common variants)
• merge the 1000 Genomes data with your own data
• proceed to Step 10

Depending on its size, your own dataset may be divided by chromosome; so, you may have to do some pre-processing before aligning to 1000 Genomes. Either way, the population specific markers will be defined by just the 1000 Genomes dataset (Step 7). If your dataset is microarray, you'll have to pre-filter it for coding (plus / +) strand variants (Step 6).

# 10, Perform PCA

plink --bfile Merge --pca


# 11, Generate plots in R

R

options(scipen=100, digits=3)

# read in the eigenvectors, produced in PLINK
eigenvec <- read.table('plink.eigenvec', header = FALSE, skip=0, sep = ' ')
rownames(eigenvec) <- eigenvec[,2]
eigenvec <- eigenvec[,3:ncol(eigenvec)]
colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '')

# read in the PED data
PED <- read.table('20130606_g1k.ped', header = TRUE, skip = 0, sep = '\t')
PED <- PED[which(PED$Individual.ID %in% rownames(eigenvec)), ] PED <- PED[match(rownames(eigenvec), PED$Individual.ID),]
all(PED$Individual.ID == rownames(eigenvec)) == TRUE [1] TRUE # set colours require('RColorBrewer') # from: http://www.internationalgenome.org/category/population/ PED$Population <- factor(PED$Population, levels=c( "ACB","ASW","ESN","GWD","LWK","MSL","YRI", "CLM","MXL","PEL","PUR", "CDX","CHB","CHS","JPT","KHV", "CEU","FIN","GBR","IBS","TSI", "BEB","GIH","ITU","PJL","STU")) col <- colorRampPalette(c( "yellow","yellow","yellow","yellow","yellow","yellow","yellow", "forestgreen","forestgreen","forestgreen","forestgreen", "grey","grey","grey","grey","grey", "royalblue","royalblue","royalblue","royalblue","royalblue", "black","black","black","black","black"))(length(unique(PED$Population)))[factor(PED$Population)] # generate PCA bi-plots project.pca <- eigenvec summary(project.pca) par(mar = c(5,5,5,5), cex = 2.0, cex.main = 7, cex.axis = 2.75, cex.lab = 2.75, mfrow = c(1,2)) plot(project.pca[,1], project.pca[,2], type = 'n', main = 'A', adj = 0.5, xlab = 'First component', ylab = 'Second component', font = 2, font.lab = 2) points(project.pca[,1], project.pca[,2], col = col, pch = 20, cex = 2.25) legend('bottomright', bty = 'n', cex = 3.0, title = '', c('Population 1', 'Population 2', 'Population 3', 'Population 4', 'Population 5'), fill = c('yellow', 'forestgreen', 'grey', 'royalblue', 'black')) plot(project.pca[,1], project.pca[,3], type="n", main="B", adj=0.5, xlab="First component", ylab="Third component", font=2, font.lab=2) points(project.pca[,1], project.pca[,3], col=col, pch=20, cex=2.25)  Kevin Tutorial 1000 genomes PCA PLINK • 19k views ADD COMMENT 0 Entering edit mode Hi Kelvin, thank you very much for your tutorial. It helps me a lot to plot the PCA-bi plot of 1kg Phrase III samples. My question is, if I would like to plot my own sample's PCA with 1kg sample's PCA (My samples were genotyped by SNP array. There were around 700k SNPs. I want to check whether my samples came from certain ethical group), shall I directly combine the eigenvec calculated from my sample and eigenvec got from your step 10 and then plot? Can the two eigen vectors comparable? I ask this question because in my understanding, PC is the linear combination of independent variables. In my sample, PCs are the linear combination of the pruned genotyped SNPs whereas in 1kg samples, their PCs are the linear combination of other pruned (and much more) SNPs. I doubt whether I can directly compare the PCs. If I can't compare my sample's PCs with 1kg's in this way, how can I compare the ethical group of my sample to the reference 1kg samples? Thank you very much for your possible reply in advance. ADD REPLY 0 Entering edit mode Hey, my apologies for I seem to have missed this. It is better to combine them within PLINK and then perform PCA on the combined dataset. I have done this many times and it usually works, in the sense that sample ethnicities line up with the expected group from 1000 Genomes. ADD REPLY 0 Entering edit mode Hi Kevin, Does the following command do a liftover of 1000 genomes to Hg19. I need to do QC on 1000 genomes data before using for LD clumping and as I haven't done this before, I'm a bit unsure. I followed your steps until this point and then proceeded to convert to plink, removed SNPs based on call rate and HWE and will then proceed to use them for LD clumping for chr in {1..22}; do bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \ ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \

bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' |

bcftools norm -Ob --rm-dup both \
> ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ; bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;


done

Thanks

0
Entering edit mode

Hey, I am not sure what you mean by 'liftover' in this context (?). If you want to do LD clumping, then you just need to complete steps 1-5. This will take ~1 day, though. The imputed 1000 Genomes data is 'quite' dense.

0
Entering edit mode

The commands that you have pasted into your comment do the following:

• bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta - checks that the REF variants in the 1000 Genomes VCF files corresponds to the base in the reference genome, human_g1k_v37.fasta.
• bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' - annotates the ID field in the 1000 Genomes VCF files to be, for example, chr1:12345645:A:T
• bcftools norm -Ob --rm-dup both - the 1000 Genomes data contains duplicate variants, which have to be removed for PLINK
0
Entering edit mode

Hi Kelvin, thank you very much for your tutorial. It really helped me a lot.

I am trying to check that the 89 individuals I am working with are of European origin by comparing them with 1000 Genomes dataset. However, the data I have for my samples was obtained from targeted sequencing of 1,293 regions. My main concern is whether I have to extract only SNPs falling inside my targeted regions from the 1000 Genomes dataset before merging with my 89 samples or, whether I should use all the information from 1000 Genomes.

0
Entering edit mode

Hey, I previously achieved this by building a predictive model from the 1000 Genomes Data and then applying it to my own data, which was also targeted sequencing. See here: A: How to predict individual ethnicity information by using hapmap data

The other way to do it is filter the datasets so that they have the same starting list of variants, and then proceed with the tutorial from part 7. For this to work, it would help if your variants are all labeled by rs ID or as chr:bp:ref:var

0
Entering edit mode

Thanks so much for this tutorial, Kevin !!

0
Entering edit mode

¡De nada!, Mr Locuace

0
Entering edit mode

Hey Kelvin,

How to plot sub-population ("ACB","ASW","ESN","GWD","LWK","MSL","YRI", "CLM","MXL","PEL","PUR", "CDX","CHB","CHS","JPT","KHV", "CEU","FIN","GBR","IBS","TSI", "BEB","GIH","ITU","PJL","STU") in pca plot instead of "African","Hispanic","East-Asian","Caucasian","South Asian"?

1
Entering edit mode

Hey, you will simply have to modify the input colour vector. You will have to take great care to ensure that the colours line up to the data that is being plotted.

0
Entering edit mode

Hi Kevin, Just a small typo, it's actually:

gunzip human_g1k_v37.fasta.gz ;

0
Entering edit mode

0
Entering edit mode

thanks alot for this tutorial i have a vcf data file (my data)and i want to do pca how do i do that? i apologise for a naive question but i am a beginner and badly struck

1
Entering edit mode

With your own data, your first step should be to import it to PLINK format, and, from there, you can perform PCA immediately. In order to obtain 'population informative' variants, though, you will have to do some of the filtering steps as I do in this tutorial. Basically, you can follow the same tutorial as here.

If you want to integrate your data with the 1000 Genomes data, then it can be a bit more difficult.

0
Entering edit mode

Thanks alot kevin I dont have any fasta file just a vcf (freebayes) and i want to perform pca to check out similarity of my sample with other ethnic groups do i need to have a fasta file of my data?

1
Entering edit mode

Hello. No, you can start with the VCF. Your first task should be to convert this VCF to PLINK format and ensure that you have retained sample information in the PLINK files. Do you know how to do this? Are all of your samples from the same ethnic group?

0
Entering edit mode

Hello kevin I am using bcftools to normalize and annotate variants. link to my data a vcf of 10mb When i try to convert the data in bcf like the third step of this wonderful tutorial. I get an error the sequence chrM not found. The error message indicates that there are some variants on the mitochondrial chromosome but that this chromosome is either missing from your reference or annotated differently.(I am using the GRCh37 / hg19 fasta file).I have only one sample till now and the coming samples will have the same ethnicity. I want to perform PCA and see the similarity of my sample with the known ("African","Hispanic","East-Asian","Caucasian","South Asian"). can you please suggest me what to I apolosize for asking a naive question(I have just started out in this field) and my english is a so not very good so thanks alot for cooperating and helping me out.

0
Entering edit mode

Hello, you just need to remove the chrM variants from your input file. You can do this indirectly or directly in many different ways, such as: A: How to extract specific chromosome from vcf file

0
Entering edit mode

Hello kevin THANKS A LOT !!!

I remove the chroM varians from my data using the grep -w '^#|^#CHROM|^chr[1-22]' my.vcf > my_new.vcf (I hope it keeps the header and the 22 chr I need) as other methods were either giving index error or header error.

but now I get an error

Now I am getting an error that sequence "chr1" not found

I want to get the pca plot of the variants but i am getting badly struck and thanks a lot for helping me

0
Entering edit mode

Perhaps the header has been lost. Can you provide the output of

bcftools view -h my_new.vcf


?

For parsing VCFs, I usually use AWK. Using grep, I would have used something like this, I think:

grep -e "^chrM" -v my.vcf

0
Entering edit mode

Hello kevin i used an editor named as em editor to change the vcf file since the annotation in the fasta file and my vcf file was different i changed my vcf file accordingly chr1 became 1 chr 2 became 2 and so on and i removed the chrM from my data, by following this method might have preserved the header the result of bcftools view -h my_new.vcf .

0
Entering edit mode

It seems that chrM is no longer present in the file. However, your contig names are just 1, 2, 3, 4, but the BCFtools norm command is searching for 'chr1', 'chr2', etc. It is looking for these due to the fact that these are the contig names used in human_g1k_v37.fasta.

To be honest, the use of this part is not necessary: --check-ref w -f human_g1k_v37.fasta You can just exclude it and use:

bcftools norm -m-any data.vcf \
...


What are the contig names in the 1000 Genomes data that downloaded? - you can use the BCFtools view -h command again.

0
Entering edit mode

Hello

   bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
data.vcf | \

bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \

bcftools norm -Ob --rm-dup both \
> data.bcf ;


result : chrM not found error After removing the chrM the same command gave :chr1 not found error Then i edited the vcf file changed the chr1 into 1 and so on I now get the error

I have tried everything i altered my vcf according to the fast but still i am not able to get the bcf file. Is there another way to do ethnicity regagrding pca?

THANKS ALOT ! I wish for your best health keep safe

0
Entering edit mode

So I have some (probably hispanic descent) tumor WGS samples in red dots on my PCA plot projected onto 1kg data like in this tutorial (btw thank you Kevin Blighe for pointing me to this tutorial), and I am wondering how closely can we expect unknown samples to follow any given 1kg clusters. My red dots, although we believe they are hispanic, seem to form a group all of their own, which could be because they all have the same cancer, but I would have thought they should cluster more closely with the AMR super-population, no?

Is there a way to verify how accurate my PC projection is?

0
Entering edit mode

Hey, looks like you at least got through the tutorial successfully - it can be difficult, particularly when overlaying your own samples. From my experience, and I have done this a good few times, one's own samples can align quite well with the 1KG samples, once you have done the preparatory work correctly.

The AMR (Ad Mixed American) group is clearly the most diverse in the cohort, but it is sampling from quite a diverse pool:

• MXL Mexican Ancestry from Los Angeles USA
• PUR Puerto Ricans from Puerto Rico
• CLM Colombians from Medellin, Colombia
• PEL Peruvians from Lima, Peru

[ https://www.internationalgenome.org/category/population/ ]

There's a potential that those (your) samples could be verging into the African group, too. I have also seen Brazilian samples group in that region, previously.

If I were you, I would check PC1 vs PC3, too, as the AMR group should separate out on its own plane along PC3.

It does genuinely work very well, though. I built a predictive model from the eigenvectors previously, and made predictions on samples of known and unknown ethnicity. The model had >99% sensitivity and specificity - A: How to predict individual ethnicity information by using hapmap data

0
Entering edit mode

hello i want to plot an pca just like u but with my data can you give me some advice on what to do?

0
Entering edit mode

Hi, I need to see examples of your data and understand what you currently have. Otherwise, I cannot do anything.

0
Entering edit mode

hello kevin i also want to merge my data(data of 10 persons) with the 1000 genome data and want to plot a pca like the one above.the categories will be the existing ethnicity 'Population 1', 'Population 2', 'Population 3', 'Population 4', 'Population 5' and a label "my data" the plot will contatin pca of 1000 genome data along with my points (my samples). similar in the manner the user omg what i am doing has done. what i have to do can you please advise me.

0
Entering edit mode

I think that you basically repeat the process (above) independently for each dataset. When you come to step 9, however, there will be a few different steps:

• Step 9a: merge the 1000 Genomes data as per the tutorial
• Step 9b: find common variants between your dataset and the merged 1000 Genomes dataset (and filter both for these common variants)
• Step 9c: merge the 1000 Genomes data with your own data
• Step 10: perform PCA
1
Entering edit mode

I have now added these extra steps to the main tutorial.

0
Entering edit mode

hello i get this error at step 4

0
Entering edit mode

Hello. Can you confirm that every step prior to Step 4 has completed successfully?

0
Entering edit mode

yes every thing else was successfully completed

0
Entering edit mode

Are you sure? - I see no proof of that in your screenshot. What is the output of

ls -l

0
Entering edit mode

hello kevin !

is there is some thing wrong with my computer ? i can also screen share with you or send u my data if i will help you to help me i have 10 individual vcf samples and want the pca plot with my data labeled as "our data" thanks alot kevin fr your support and help..i relly appreciate your efforts!! stay safe and stay healthy

0
Entering edit mode

It looks like your BCF files have not yet been created. You need to re-run Step 4 (above)

0
Entering edit mode

Hello kevin

every step prior to Step 4 has completed successfully as you can see when i run ls -l command all the files are successfully downloaded

When i run step 4 i get this error

0
Entering edit mode

If you run the commands exactly as they appear in the tutorial, there should not be a problem. Can you confirm the shell that you are using.

Also, next time, can you show the value of:

ls -l | head


...and also paste the full command that you are using. When I say 'paste', can you paste the text here.

0
Entering edit mode

thanks a lot Kevin for helping me

ls -l | head gives this output

kk@DESKTOP-0SEI0MU:/mnt/d/biols -l |head total 17711460 -r-xr-xr-x 1 kk kk 155869 Jun 21 20:53 20130606_g1k.ped -r-xr-xr-x 1 kk kk 1216886729 Jun 21 03:36 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -r-xr-xr-x 1 kk kk 224660 Jun 21 03:36 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi -r-xr-xr-x 1 kk kk 773788987 Jun 21 11:54 ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -r-xr-xr-x 1 kk kk 133086 Jun 21 11:54 ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi -r-xr-xr-x 1 kk kk 767084423 Jun 21 13:52 ALL.chr11.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -r-xr-xr-x 1 kk kk 133331 Jun 21 13:52 ALL.chr11.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi -r-xr-xr-x 1 kk kk 740805962 Jun 21 15:14 ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -r-xr-xr-x 1 kk kk 133171 Jun 21 15:29 ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi  ADD REPLY 0 Entering edit mode Hi! So, you still need to re-run Step 4. It has evidently not completed successfully due to the fact that there are no BCF files. ADD REPLY 0 Entering edit mode hello kevin i apologies for taking your time but i followed every thing you can see that all files were downloaded every thing worked till step 3 but command at step 4 was not able to execute you can check as i have attached screen shot please let me know what i can do. Thank alot kevin stay safe ADD REPLY 0 Entering edit mode I am in good health. I trust that you are, too. Espero que estés bien. If you run this, is there a different error message? bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \ ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \ bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \ bcftools norm -Ob --rm-dup both \ > ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;  ADD REPLY 0 Entering edit mode Was this resolved? ADD REPLY 0 Entering edit mode Hello 17bcs030, Please refrain from using instant messaging/SMS jargon such as "fr", "u", etc. on professional and scientific forums. Write in complete sentences and use professional language. ADD REPLY 0 Entering edit mode I will try not to use these jargons i will use complete sentences and use professional language only, My sincere apologies ADD REPLY 0 Entering edit mode Hi Kevin, thank you for your tutorial. I want to merge my own data with 1kg and plot PCA-bi plot, I have some questions for these: Q1:In "step7, Prune variants from each chromosome", why don't you exclude long high-LD regions? Isn't that necessary? Q2: 1）a. Prune 1kg and my own dataset respectively ; b. Merge the two datasets 2) a. Merge the two datasets; b. Prune them Is there a difference between the two strategies? Q3: Does 1kg data need to be quality control? e.g.: --geno --mind --maf Q4: I‘m not very clear about this command: --indep 50 5 1.5 , if I removed same individuals or variants from my data, Do I need to regenerate a *prune.in file ? Thank you for your time in helping me with these questions. ADD REPLY 0 Entering edit mode Hey, Q1:In "step7, Prune variants from each chromosome", why don't you exclude long high-LD regions? Isn't that necessary? You can do that. Actually, pruning for LD is performed via the --indep flag (see your question 4) Q2: 1）a. Prune 1kg and my own dataset respectively ; b. Merge the two datasets 2) a. Merge the two datasets; b. Prune them Is there a difference between the two strategies? Yes, there is a difference, and one will obtain different results depending on the order of these operations. I have not tested which order is preferential over the other. Q3: Does 1kg data need to be quality control? e.g.: --geno --mind --maf It is already of good quality from source. However, in this tutorial, I provide a normalisation step to split multi-allelic calls in step 4. Q4: I‘m not very clear about this command: --indep 50 5 1.5 , if I removed same individuals or variants from my data, Do I need to regenerate a *prune.in file ? Thank you for your time in helping me with these questions. This is pruning based on LD. More information here: https://www.cog-genomics.org/plink/1.9/ld It would be altered if samples were removed - yes. ADD REPLY 0 Entering edit mode Hi Kevin Blighe, thanks for this incredible tutorial! I was wondering: having obtained the plink.eigenval and plink.eigenvec files using plink's --pca function, is there an objective way of identifying how many population covariates should be included in downstream genetic analyses (e.g. generation of polygenic risk scores or eQTL work)? I am following this other post, but I am not sure I understand what's this objective way. Should I analyse PC 1 by 2, 2 by 3, 3 by 4, 4 by 5 and so on, until I find a normal distribution? Is there a "numeric" score per PC that could be useful to make this more objective (e.g., include only principal components as covariates if they are associated with eigenvalues > 1.5?). Any clarification on this would be much appreciated! Also, plink automatically calculates 20 PCs. If I ask plink to calculate 10 PCs, will the first few PCs be different between the two different runs (when I asked plink to calculate 10 vs when it calculated 20)? I'll be testing this later on so I can answer this one in a while - I am assuming that values will change. Finally, if I have a variable like "genotype chip name" (samples from different institutions were genotyped using a different SNP chips), is this something I can include as a factor in the calculation of my PCs? (Or is this something that will drive the PC values I obtain anyway, so no need to do this?) Muito obrigado, caríssimo!! And I am sorry if these questions are too basic. ADD REPLY 1 Entering edit mode Boa tarde! Some people just inspect the bi-plots and make a decision about PCs to include (as covariates) in that way. In my opinion, however, it is better to regress the PCs against population, and then retain the statistically significant PCs as covariates. For example: glm(population ~ PC1, ...) glm(population ~ PC2, ...) ... glm(population ~ PC[x], ...)  , or: lm(PC1 ~ population, ...) lm(PC2 ~ population, ...) ... lm(PC[x] ~ population, ...)  The values to use for the PCs should be in the eigenvec file (?). The PCs to retain are those that pass p<0.05. Also, plink automatically calculates 20 PCs. If I ask plink to calculate 10 PCs, will the first few PCs be different between the two different runs (when I asked plink to calculate 10 vs when it calculated 20)? I'll be testing this later on so I can answer this one in a while - I am assuming that values will change. I am not sure, but I I think that the first 10 will always be the same. Finally, if I have a variable like "genotype chip name" (samples from different institutions were genotyped using a different SNP chips), is this something I can include as a factor in the calculation of my PCs? (Or is this something that will drive the PC values I obtain anyway, so no need to do this?) This is likely a batch covariate, yes, and it may be altering the PCA calculations. The biggest worry is that the different chips / arrays may be genotyping different SNPs, or different strands (coding / non-coding; plus / minus) of the same SNP. Some upstream pre-processing (before PCA) should be performed to ensure alignment across the different chips, I think. ADD REPLY 0 Entering edit mode Hello Kevin, I´m sorry for bothering you. I´m trying to convert 1000 genomes vcf file (ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes) to plink format and merge it with my in-house dataset (already in plink format) so I can perform admixture analysis and maybe PCA also. I would need to follow all the steps you detailed here? What would be the problem if I just do the conversion of the vcf with plink --vcf ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes --out cromX1KG ? I would be very gratefull with your clarification!! Iriel ADD REPLY 0 Entering edit mode Hey, no problem. In this tutorial, I do not focus on any allosome (chrX or Y); however, the data is available at the FTP sites: Yes, you can follow some of the steps here in order to merge your data with the 1000 Genomes chrX data, primarily Steps 4 + 5, and then follow advice Here ADD REPLY 1 Entering edit mode Thanks for the answer! I will try to follow your instructions. Iriel ADD REPLY 0 Entering edit mode De nada! There is technically no right or wrong way in which to merge the datasets. The 'merge' can occur at different stages of processing. Also, you may not want to do, for example, Step 7, from my tutorial above. ADD REPLY 0 Entering edit mode Hi Kevin, I have tried steps 4 and 5 of your tutorial on chromosome X vcf from 1000 genomes, then I filtrate my in house dataset of chromosome X SNPs to only keep SNPs on the + strand (according to Illumina microarray files). I also found the common SNPs using the intersect function from R and extracted those common SNPs from each dataset. The problem comes out when I try to merge both datasets with plink plink --bfile chrXKG2_int --bmerge cromXplus_int.bed cromXplus_int.bim cromXplus_int.fam --make-bed --out cromXmerged  Plink throws an error saying that there are several variants with 3+ alleles present, that this could be due to strand inconsistency. I think I should do the same selection of + strand genotyped SNPs in 1000 genomes dataset, do you have any command to do that? Thanks in advance for your answer! ADD REPLY 1 Entering edit mode Yes, you would need to filter the 1000 Genomes dataset for these coding (+) strand variants too. I believe the PLINK flag to do this is --exclude ADD REPLY 0 Entering edit mode Hola kevin gracias por un tutorial tan maravilloso I have my few sample files and I want to merge them with the 1000 genome and want to perform the PCA. I followed your great tutorial but I get an error at step 4. I am using bcftools 1.3 and I have performed all steps 1,2,3. Here you can see I have downloaded all the 1000 genome files along with the reference fasta When I run the Step 4 I get this error When I run this:  bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \ ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \ bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \ bcftools norm -Ob --rm-dup both \ ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;  I get bcf for chr1 only that too incomplete(few bytes in size), if I replace chr1 with chr2 I again get the same segmentation error I am using a laptop with 16 GB of ram. I am badly struck. thanks a lot for your help Mantente seguro DIOS TE BENDIGA, REZO POR TU BUENA SALUD ADD REPLY 0 Entering edit mode Please do not paste screenshots of plain text content - it is counterproductive. Use GitHub gists to show us the actual text content instead. See this post on how to use GitHub gists: A: How to Use Biostars Part-3: Formatting Text and Using GitHub Gists ADD REPLY 0 Entering edit mode greetings Thanks alot for your valuable suggestion i will make sure to follow the method you have prescribed ADD REPLY 0 Entering edit mode Hi, you seem to be using two accounts: krishnakatyal5121 and 17bcs030 Please do not do this. Stick to one account. Send an email to the administrator (admin at biostars dot org) and request that your accounts be merged. ADD REPLY 0 Entering edit mode Greetings I have mailed the administrator,If I have caused any incoviniance I ask for apologies. Thanks alot , I request if you could help me with my problem.stay safe ADD REPLY 0 Entering edit mode Greetings, I am a complete beginner. I was trying to perform step 1, but got stuck. Get the following output where it says, NO SUCH FILE, this is for all chromosomes...Hope someone will explain what i have done wrong. Thanks Lax --2021-04-21 17:09:08-- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz => 'ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz' Resolving ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)... 193.62.193.140 Connecting to ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)|193.62.193.140|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /vol1/ftp/release/20130502 ... done. ==> SIZE ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz ... done. ==> PASV ... done. ==> RETR ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz ... **No such file** 'ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'.  ADD REPLY 1 Entering edit mode Hi Lax, I just tried this and got the same error as you when typing in the tutorial commands as written. If you go to the ftp url from the prefix in your browser, (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/), you'll see that the file names are slightly different: version b instead of version a as written in the original tutorial (e.g. ...integrated_v5b.20130502...). When I updated the suffix text, the download worked for me. Rewritten step 1 with updated file name suffix: prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ; suffix=".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz" ; for chr in {1..22}; do wget "{prefix}""${chr}""${suffix}" "${prefix}""${chr}""{suffix}".tbi ; done  Hope this helps! ADD REPLY 0 Entering edit mode Hi Kevin I have used the --make-pgen flag which give me .pgen, psam, and pvar, after that I used pca command and got .eigenvec and .eigenval. I want to draw a plot using R but the script here used PED. how can I plot in R for Psam, pvar and pgen? ADD REPLY 0 Entering edit mode Hi, --make-pgen just creates a binary dataset. Can you not recode this as plain text PED? Sorry, I am tired and will check back in the morning ADD REPLY 0 Entering edit mode > R Error: object 'R' not found > > options(scipen=100, digits=3) > > # read in the eigenvectors, produced in PLINK > eigenvec <- read.table('plink2.eigenvec', header = FALSE, skip=0, sep = ' ') > rownames(eigenvec) <- eigenvec[,2] Error in [.data.frame(eigenvec, , 2) : undefined columns selected > eigenvec <- eigenvec[,3:ncol(eigenvec)] Error in [.data.frame(eigenvec, , 3:ncol(eigenvec)) : undefined columns selected > colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '') Error in names(x) <- value : 'names' attribute [20] must be the same length as the vector [1]  ADD REPLY 0 Entering edit mode It comes across as a bit rude to paste some errors and provide no commentary - sorry. By having R there, I am invoking the R environment for you, i.e., you type R at your command prompt in order to load R. Regarding the other errors, please show the first few lines of eigenvec ADD REPLY 0 Entering edit mode sorry, my understanding is very low in bioinformatics and I want to plot the PCA graph using Rstudio. i have used plink for the generation of pca. i have virus vcf file of SNPs which I converted to .psam, pvar, and pgen. after that i use PCA command and finally got two file eigenvec and eigenv. after that i put the script in Rstudio that you provide in the tutorial. (R options(scipen=100, digits=3) # read in the eigenvectors, produced in PLINK eigenvec <- read.table('plink.eigenvec', header = FALSE, skip=0, sep = ' ') rownames(eigenvec) <- eigenvec[,2] eigenvec <- eigenvec[,3:ncol(eigenvec)] colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '')  I HAVE to try it with both typing the R and without typing the R in the command prompt still show error. ADD REPLY 0 Entering edit mode I only added the R there to indicate that you then need to move from the shell command prompt to the R environment. In your case, you just need to move to RStudio. ADD REPLY 0 Entering edit mode which library is need to plot PCA graph in R. ggplot, tydiverse ADD REPLY 0 Entering edit mode You just need base R functions. I am guessing that your input data is different, for some reason. How did you generate the plink.eigenvec file? - please show the first few lines from it. ADD REPLY 0 Entering edit mode below is the Ist lines of plink.eigenvec #IID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 MH218731.1 0.0152208 0.00944359 0.0103379 -0.016281 -0.0104166 -0.0147989 -0.00897963 0.0352926 -0.00974202 -0.218918 MG572182.1 0.0377849 -0.0284037 0.000547742 0.00247451 0.00140346 0.0019857 0.0104293 -0.0535947 -0.0190876 -0.0980317 MG746041.1 0.047474 -0.0575027 -0.00575743 0.0124768 0.00614564 0.011023 0.00851316 -0.0434487 -0.0448973 -0.000333792 MG746042.1 0.047592 -0.0572615 -0.00527609 0.0126947 0.00621977 0.0110911 0.00909679 -0.041873 -0.0446051 0.000632583 MG746043.1 0.0473333 -0.0576048 -0.00583835 0.0118375 0.00520263 0.0110696 0.00909136 -0.0422423 -0.0435948 -0.000432457 MG746044.1 0.0473676 -0.0576595 -0.00571874 0.0119169 0.00565738 0.010653 0.00851585 -0.0424485 -0.0444913 -0.000936355 MG746045.1 0.0473575 -0.057696 -0.0058114 0.0120366 0.00569223 0.0107174 0.0087301 -0.0424384 -0.0439578 0.000366145 MH671553.1 0.0350993 -0.0216172 0.00691768 -0.00643424 -0.00312974 -0.0278186 -0.0265101 0.163726 -0.100952 0.0419335 MG746040.1 0.0471743 -0.058322 -0.00612036 0.0125193 0.00663086 0.0109957 0.00841344 -0.0417326 -0.0416117 0.00165737 MG746010.1 0.0471363 -0.0583 -0.00623652 0.0125998 0.00715682 0.0109305 0.00864048 -0.0419027 -0.0419498 0.00148289 MG746005.1 0.0473502 -0.0577398 -0.00630305 0.0123974 0.00685714 0.0109009 0.00909979 -0.0418438 -0.0434613 0.00155532 MG746006.1 0.0475522 -0.0575318 -0.00610206 0.0122633 0.00640293 0.0114584 0.00826131 -0.0416395 -0.0444442 0.000140273 MG746007.1 0.0475734 -0.0572951 -0.00571316 0.0122144 0.00648051 0.0105819 0.00822079 -0.041815 -0.0413793 0.00127995 MG746008.1 0.0477568 -0.0571485 -0.00535316 0.0124548 0.00599396 0.0105851 0.00793466 -0.0419781 -0.0430056 -0.0027565  ADD REPLY 0 Entering edit mode Hi, I got an error when running ‘gunzip human_g1k_v37.fasta.gz’ in step 3. The error message is 'unexpected end of file'. Is the file damaged? Thank you. ADD REPLY 0 Entering edit mode Can you confirm that the download completed? ADD REPLY 0 Entering edit mode It shows that it is done. Is the compressed file 115 KB? By the way, when I used a decompress software, although it showed that the file cannot be extracted properly, there was an output file. In the output file, there are many rows consisting of N after the first row, and then comes T, A, C, G. Is it normal? I am afraid that the file has some differences compared to the original one. ADD REPLY 0 Entering edit mode No, human_g1k_v37.fasta.gz is just a gunzipped version of the hg19 reference genome; so, ~900MB in size. Can you see it by going here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/ For the record, I have an encoding algorithm that can 'zip' this genome to below that of 2bit. ADD REPLY 0 Entering edit mode Yes, I can see the file. But when I downloaded it, the download just finished when the file size is 115 KB. ADD REPLY 0 Entering edit mode Please check available disk space, etc. Apologies if you have already done this. ADD REPLY 0 Entering edit mode Hi Charles, I am having the same issue. Do you manage to solve it? gunzip human_g1k_v37.fasta.gz  gzip: human_g1k_v37.fasta.gz: unexpected end of file ADD REPLY 0 Entering edit mode Can you please show the complete output after you have run: wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz ;  ADD REPLY 0 Entering edit mode wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz --2021-06-21 10:34:13-- http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz Resolving ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)... 193.62.197.77 Connecting to ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)|193.62.197.77|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/octet-stream] Saving to: ‘human_g1k_v37.fasta.gz’ [ <=> ] 3,424,591 257KB/s in 13s  2021-06-21 10:34:26 (263 KB/s) - ‘human_g1k_v37.fasta.gz’ saved [3424591] This seems to be a Ubuntu issue. I used Mac to do wget and gunzip. It output gunzip: human_g1k_v37.fasta.gz: trailing garbage ignored but also the human_g1k_v37.fasta. I can proceed with the fasta without problem. ADD REPLY 0 Entering edit mode Okay, that is strange. I just tried wget (Ubuntu 16.04) and my connection was refused. I am now trying with curl and it seems okay. curl http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz -O human_g1k_v37.fasta.gz  ADD REPLY 0 Entering edit mode Hi Kevin, thank you for this tutorial! I am following along to combine my own microarray data with 1000G for downstream analyses. Would it be possible for you to expand on the reason for Step 6 (excluding variants not on the coding strand). I'm wondering why this step is necessary if I have aligned both the microarray data and 1000G data to the same reference genome. Thank you in advance for explaining further. ADD REPLY 0 Entering edit mode Hi Olivia, we need Step 6 when using array data due to the fact that array probes can target the sense (coding) or non-sense (non-coding) strands, whereas NGS (used by 1000 Genomes) only sequences the coding strand. Integrating array data with the 1000 Genomes dataset can be tricky - I hope that you manage to do it. ADD REPLY 0 Entering edit mode Hi Kevin, Thank you for such a comprehensive tutorial. I am using 1K data in build 38. While using R to draw the plots, I do not have any individuals left in the PED file. It comes as NA instead of TRUE. I have tried to search a different pedigree file but have not found one. Any suggestions are appreciated. ADD REPLY 0 Entering edit mode Hi, there's not much that I can do to help, right now. You have not shown any code or examples of input data. ADD REPLY 0 Entering edit mode Hi Kevin, I am using the R script you have shown above to draw the plots. #Read the eigenvec file eigenvec <- read.table('plink.eigenvec', header = FALSE, skip=0, sep = ' ') rownames(eigenvec) <- eigenvec[,2] eigenvec <- eigenvec[,3:ncol(eigenvec)] #Read the PED file colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '') PED <- read.table('20130606_g1k.ped', header = TRUE, skip = 0, sep = '\t') PED <- PED[which(PEDIndividual.ID %in% rownames(eigenvec)), ]
PED <- PED[match(rownames(eigenvec), PED$Individual.ID),] all(PED$Individual.ID == rownames(eigenvec)) == TRUE
[1] NA

Below is the eigenvec file:

Principal Component 1 Principal Component 2 Principal Component 3 Principal Component 4
82008            0.00664539            0.00406668            0.00301638          -0.003486120
82009            0.00713539            0.00526112            0.00220775          -0.000691655
82010            0.00689200            0.00476822            0.00343645          -0.001918870
82013            0.00674223            0.00397111            0.00457880          -0.002049680

0
Entering edit mode

Well, the problem is in the code chunk, #Read the PED file. Can you re-check that you have copied my code directly? I can already see that one of your lines is misplaced (out of place).

You will know that it is functioning correctly when this line returns TRUE:

all(PED\$Individual.ID == rownames(eigenvec)) == TRUE

0
Entering edit mode

Thank you. That is true. However, the code is the same as yours and I still get NA. Manually comparing, I see that there are no common individuals in these two files (the ids are different in the fam file of 1K in build 38 and the ped file). That is why, I was wondering whether there is another PED file for build 38 that I am not able find.

0
Entering edit mode

Please use ADD REPLY when responding to existing posts. Do not add comments as new answers to this thread.

0
Entering edit mode

I have not looked in depth at the hg38 data, unfortunately. The participant IDs should be the same [I would hope...]. Can you show an example of how the IDs appear in plink.eigenvec?

Traffic: 2189 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.