Tutorial: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2
46
gravatar for Kevin Blighe
23 months ago by
Kevin Blighe63k
Kevin Blighe63k wrote:

Note1 - Previous version: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

Note2 - this data is for hg19 / GRCh37

Note3 - GRCh38 data is available HERE


The tutorial has been updated based on the 1000 Genomes Phase III imputed genotypes. The original tutorial was performed on non-imputed data held at the University of Washington, which is no longer accessible.

Other changes:

  • tutorial now entirely streamlined - all commands, including in R, are now included
  • duplicate variants are now removed with BCFtools, not PLINK (previous Step 6 removed)
  • now only performs PCA (originally, MDS was also performed but never used)
  • no longer using chrX variants (only autosomal variants)
  • new Step 3, indicating how to download the 1000 Genomes GRCh37 reference build

Program requirements:

  • plink > v1.9
  • BCFtools (tested on v 1.3)

Disk space requirements:

  • downloaded data (VCF.gz and tab-indices), ~ 15.5 GB
  • converted BCF files and their indices, ~14 GB
  • binary PLINK files, ~53 GB
  • pruned PLINK binary files, ~ <1 Gb

1, Download the files as VCF.gz (and tab-indices)

prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ;

suffix=".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" ;

for chr in {1..22}; do
    wget "${prefix}""${chr}""${suffix}" "${prefix}""${chr}""${suffix}".tbi ;
done

2, Download 1000 Genomes PED file

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;

3, Download the GRCh37 / hg19 reference genome

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz ;

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai ;

gunzip human_g1k_v37.fasta.gz ;

4, Convert the 1000 Genomes files to BCF

  • Ensure that multi-allelic calls are split and that indels are left-aligned compared to reference genome (1st pipe)
  • Sets the ID field to a unique value: CHROM:POS:REF:ALT (2nd pipe)
  • Removes duplicates (3rd pipe)

-I +'%CHROM:%POS:%REF:%ALT' means that unset IDs will be set to CHROM:POS:REF:ALT

-x ID -I +'%CHROM:%POS:%REF:%ALT' first erases the current ID and then sets it to CHROM:POS:REF:ALT

for chr in {1..22}; do
    bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
      ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \
      bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \
        bcftools norm -Ob --rm-dup both \
          > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

    bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;
done

5, Convert the BCF files to PLINK format

for chr in {1..22}; do
    plink --noweb \
      --bcf ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf \
      --keep-allele-order \
      --vcf-idspace-to _ \
      --const-fid \
      --allow-extra-chr 0 \
      --split-x b37 no-fail \
      --make-bed \
      --out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;
done

6, Exclude variants not on the coding strand

NB - This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense)

7, Prune variants from each chromosome

--maf 0.10, only retain SNPs with MAF greater than 10%
--indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]

e.g. indep 50 5 1.5, Generates a list of markers in approx. linkage equilibrium - takes 50 SNPs at a time and then shifts by 5 for the window. VIF (1/(1-r^2)) is the cut-off for linkage disequilibrium

mkdir Pruned ;

for chr in {1..22}; do
    plink --noweb \
      --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes \
      --maf 0.10 --indep 50 5 1.5 \
      --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;

    plink --noweb \
      --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes \
      --extract Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.prune.in \
      --make-bed \
      --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;
done

8, Get a list of all PLINK files

find . -name "*.bim" | grep -e "Pruned" > ForMerge.list ;

sed -i 's/.bim//g' ForMerge.list ;

9, Merge all projects into a single PLINK file

plink --merge-list ForMerge.list --out Merge ;

NB - if you have your own data that you want to merge with 1000 Genomes

Process the 1000 Genomes data as per this tutorial from Steps 1-9. In this way, you will have already identified the population-specific variants / markers. Then, after Step 9, do

  • find common variants between your dataset and the merged 1000 Genomes dataset (and filter both for these common variants)
  • merge the 1000 Genomes data with your own data
  • proceed to Step 10

Depending on its size, your own dataset may be divided by chromosome; so, you may have to do some pre-processing before aligning to 1000 Genomes. Either way, the population specific markers will be defined by just the 1000 Genomes dataset (Step 7). If your dataset is microarray, you'll have to pre-filter it for coding (plus / +) strand variants (Step 6).

10, Perform PCA

plink --bfile Merge --pca

11, Generate plots in R

R

options(scipen=100, digits=3)

# read in the eigenvectors, produced in PLINK
eigenvec <- read.table('plink.eigenvec', header = FALSE, skip=0, sep = ' ')
rownames(eigenvec) <- eigenvec[,2]
eigenvec <- eigenvec[,3:ncol(eigenvec)]
colnames(eigenvec) <- paste('Principal Component ', c(1:20), sep = '')

# read in the PED data
PED <- read.table('20130606_g1k.ped', header = TRUE, skip = 0, sep = '\t')
PED <- PED[which(PED$Individual.ID %in% rownames(eigenvec)), ]
PED <- PED[match(rownames(eigenvec), PED$Individual.ID),]
all(PED$Individual.ID == rownames(eigenvec)) == TRUE
[1] TRUE

# set colours
require('RColorBrewer')

# from: http://www.internationalgenome.org/category/population/
PED$Population <- factor(PED$Population, levels=c(
  "ACB","ASW","ESN","GWD","LWK","MSL","YRI",
  "CLM","MXL","PEL","PUR",
  "CDX","CHB","CHS","JPT","KHV",
  "CEU","FIN","GBR","IBS","TSI",
  "BEB","GIH","ITU","PJL","STU"))

col <- colorRampPalette(c(
  "yellow","yellow","yellow","yellow","yellow","yellow","yellow",
  "forestgreen","forestgreen","forestgreen","forestgreen",
  "grey","grey","grey","grey","grey",
  "royalblue","royalblue","royalblue","royalblue","royalblue",
  "black","black","black","black","black"))(length(unique(PED$Population)))[factor(PED$Population)]

# generate PCA bi-plots
project.pca <- eigenvec
summary(project.pca)


par(mar = c(5,5,5,5), cex = 2.0,
  cex.main = 7, cex.axis = 2.75, cex.lab = 2.75, mfrow = c(1,2))

plot(project.pca[,1], project.pca[,2],
  type = 'n',
  main = 'A',
  adj = 0.5,
  xlab = 'First component',
  ylab = 'Second component',
  font = 2,
  font.lab = 2)
points(project.pca[,1], project.pca[,2], col = col, pch = 20, cex = 2.25)
legend('bottomright',
  bty = 'n',
  cex = 3.0,
  title = '',
  c('Population 1', 'Population 2', 'Population 3',
    'Population 4', 'Population 5'),
  fill = c('yellow', 'forestgreen', 'grey', 'royalblue', 'black'))

plot(project.pca[,1], project.pca[,3],
  type="n",
  main="B",
  adj=0.5,
  xlab="First component",
  ylab="Third component",
  font=2,
  font.lab=2)
points(project.pca[,1], project.pca[,3], col=col, pch=20, cex=2.25)

biplot-new

Kevin

plink pca tutorial 1000 genomes • 10k views
ADD COMMENTlink modified 16 days ago • written 23 months ago by Kevin Blighe63k

Hi Kelvin, thank you very much for your tutorial. It helps me a lot to plot the PCA-bi plot of 1kg Phrase III samples.

My question is, if I would like to plot my own sample's PCA with 1kg sample's PCA (My samples were genotyped by SNP array. There were around 700k SNPs. I want to check whether my samples came from certain ethical group), shall I directly combine the eigenvec calculated from my sample and eigenvec got from your step 10 and then plot? Can the two eigen vectors comparable?

I ask this question because in my understanding, PC is the linear combination of independent variables. In my sample, PCs are the linear combination of the pruned genotyped SNPs whereas in 1kg samples, their PCs are the linear combination of other pruned (and much more) SNPs. I doubt whether I can directly compare the PCs. If I can't compare my sample's PCs with 1kg's in this way, how can I compare the ethical group of my sample to the reference 1kg samples?

Thank you very much for your possible reply in advance.

ADD REPLYlink written 22 months ago by Sunshine n Rain20

Hey, my apologies for I seem to have missed this. It is better to combine them within PLINK and then perform PCA on the combined dataset. I have done this many times and it usually works, in the sense that sample ethnicities line up with the expected group from 1000 Genomes.

ADD REPLYlink written 21 months ago by Kevin Blighe63k

Hi Kevin,

Does the following command do a liftover of 1000 genomes to Hg19. I need to do QC on 1000 genomes data before using for LD clumping and as I haven't done this before, I'm a bit unsure. I followed your steps until this point and then proceeded to convert to plink, removed SNPs based on call rate and HWE and will then proceed to use them for LD clumping

for chr in {1..22}; do bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \ ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \

bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' |

bcftools norm -Ob --rm-dup both \
> ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

done

Thanks

ADD REPLYlink modified 21 months ago • written 21 months ago by kl0

Hey, I am not sure what you mean by 'liftover' in this context (?). If you want to do LD clumping, then you just need to complete steps 1-5. This will take ~1 day, though. The imputed 1000 Genomes data is 'quite' dense.

ADD REPLYlink written 21 months ago by Kevin Blighe63k

The commands that you have pasted into your comment do the following:

  • bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta - checks that the REF variants in the 1000 Genomes VCF files corresponds to the base in the reference genome, human_g1k_v37.fasta.
  • bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' - annotates the ID field in the 1000 Genomes VCF files to be, for example, chr1:12345645:A:T
  • bcftools norm -Ob --rm-dup both - the 1000 Genomes data contains duplicate variants, which have to be removed for PLINK
ADD REPLYlink written 21 months ago by Kevin Blighe63k

Hi Kelvin, thank you very much for your tutorial. It really helped me a lot.

I am trying to check that the 89 individuals I am working with are of European origin by comparing them with 1000 Genomes dataset. However, the data I have for my samples was obtained from targeted sequencing of 1,293 regions. My main concern is whether I have to extract only SNPs falling inside my targeted regions from the 1000 Genomes dataset before merging with my 89 samples or, whether I should use all the information from 1000 Genomes.

Thank you for your possible reply in advance.

ADD REPLYlink written 20 months ago by mpinsach0

Hey, I previously achieved this by building a predictive model from the 1000 Genomes Data and then applying it to my own data, which was also targeted sequencing. See here: A: How to predict individual ethnicity information by using hapmap data

The other way to do it is filter the datasets so that they have the same starting list of variants, and then proceed with the tutorial from part 7. For this to work, it would help if your variants are all labeled by rs ID or as chr:bp:ref:var

ADD REPLYlink written 20 months ago by Kevin Blighe63k

Thanks so much for this tutorial, Kevin !!

ADD REPLYlink written 18 months ago by Mr Locuace100

¡De nada!, Mr Locuace

ADD REPLYlink written 18 months ago by Kevin Blighe63k

Hey Kelvin,

How to plot sub-population ("ACB","ASW","ESN","GWD","LWK","MSL","YRI", "CLM","MXL","PEL","PUR", "CDX","CHB","CHS","JPT","KHV", "CEU","FIN","GBR","IBS","TSI", "BEB","GIH","ITU","PJL","STU") in pca plot instead of "African","Hispanic","East-Asian","Caucasian","South Asian"?

ADD REPLYlink written 15 months ago by Abdul Rafay Khan1.1k
1

Hey, you will simply have to modify the input colour vector. You will have to take great care to ensure that the colours line up to the data that is being plotted.

ADD REPLYlink written 15 months ago by Kevin Blighe63k

Hi Kevin, Just a small typo, it's actually:

gunzip human_g1k_v37.fasta.gz ;

ADD REPLYlink written 12 months ago by Raony Guimarães1.1k

Thanks! / Obrigado!

ADD REPLYlink written 12 months ago by Kevin Blighe63k

thanks alot for this tutorial i have a vcf data file (my data)and i want to do pca how do i do that? i apologise for a naive question but i am a beginner and badly struck

ADD REPLYlink written 4 months ago by krishnakatyal51210
1

With your own data, your first step should be to import it to PLINK format, and, from there, you can perform PCA immediately. In order to obtain 'population informative' variants, though, you will have to do some of the filtering steps as I do in this tutorial. Basically, you can follow the same tutorial as here.

If you want to integrate your data with the 1000 Genomes data, then it can be a bit more difficult.

ADD REPLYlink written 4 months ago by Kevin Blighe63k

Thanks alot kevin I dont have any fasta file just a vcf (freebayes) and i want to perform pca to check out similarity of my sample with other ethnic groups do i need to have a fasta file of my data?

ADD REPLYlink written 3 months ago by krishnakatyal51210
1

Hello. No, you can start with the VCF. Your first task should be to convert this VCF to PLINK format and ensure that you have retained sample information in the PLINK files. Do you know how to do this? Are all of your samples from the same ethnic group?

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe63k

Hello kevin I am using bcftools to normalize and annotate variants. link to my data a vcf of 10mb When i try to convert the data in bcf like the third step of this wonderful tutorial. I get an error the sequence chrM not found. The error message indicates that there are some variants on the mitochondrial chromosome but that this chromosome is either missing from your reference or annotated differently.(I am using the GRCh37 / hg19 fasta file).I have only one sample till now and the coming samples will have the same ethnicity. I want to perform PCA and see the similarity of my sample with the known ("African","Hispanic","East-Asian","Caucasian","South Asian"). can you please suggest me what to I apolosize for asking a naive question(I have just started out in this field) and my english is a so not very good so thanks alot for cooperating and helping me out.

ADD REPLYlink modified 3 months ago • written 3 months ago by krishnakatyal51210

Hello, you just need to remove the chrM variants from your input file. You can do this indirectly or directly in many different ways, such as: A: How to extract specific chromosome from vcf file

ADD REPLYlink written 3 months ago by Kevin Blighe63k

Hello kevin THANKS A LOT !!!

I remove the chroM varians from my data using the grep -w '^#|^#CHROM|^chr[1-22]' my.vcf > my_new.vcf (I hope it keeps the header and the 22 chr I need) as other methods were either giving index error or header error.

but now I get an errormy error

Now I am getting an error that sequence "chr1" not found

link to my data https://drive.google.com/file/d/1jKaJGI0UW3lYGxmhNUdBchZzMrb8ckZ-/view?usp=sharing

I want to get the pca plot of the variants but i am getting badly struck and thanks a lot for helping me

ADD REPLYlink modified 4 weeks ago by RamRS28k • written 3 months ago by krishnakatyal51210

Perhaps the header has been lost. Can you provide the output of

bcftools view -h my_new.vcf

?

For parsing VCFs, I usually use AWK. Using grep, I would have used something like this, I think:

grep -e "^chrM" -v my.vcf
ADD REPLYlink written 3 months ago by Kevin Blighe63k

Hello kevin i used an editor named as em editor to change the vcf file since the annotation in the fasta file and my vcf file was different i changed my vcf file accordingly chr1 became 1 chr 2 became 2 and so on and i removed the chrM from my data, by following this method might have preserved the header the result of bcftools view -h my_new.vcf result1 . result2 result3 resluty

ADD REPLYlink written 3 months ago by krishnakatyal51210

It seems that chrM is no longer present in the file. However, your contig names are just 1, 2, 3, 4, but the BCFtools norm command is searching for 'chr1', 'chr2', etc. It is looking for these due to the fact that these are the contig names used in human_g1k_v37.fasta.

To be honest, the use of this part is not necessary: --check-ref w -f human_g1k_v37.fasta You can just exclude it and use:

bcftools norm -m-any data.vcf \
...

What are the contig names in the 1000 Genomes data that downloaded? - you can use the BCFtools view -h command again.

ADD REPLYlink written 3 months ago by Kevin Blighe63k

Hello

`   bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
 data.vcf | \

  bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \

    bcftools norm -Ob --rm-dup both \
      > data.bcf ;

` result : chrM not found error After removing the chrM the same command gave :chr1 not found error Then i edited the vcf file changed the chr1 into 1 and so on I now get the error after removing chrM and changing the vcf

I have tried everything i altered my vcf according to the fast but still i am not able to get the bcf file. Is there another way to do ethnicity regagrding pca?

THANKS ALOT ! I wish for your best health keep safe

ADD REPLYlink modified 3 months ago • written 3 months ago by krishnakatyal51210

So I have some (probably hispanic descent) tumor WGS samples in red dots on my PCA plot projected onto 1kg data like in this tutorial (btw thank you Kevin Blighe for pointing me to this tutorial), and I am wondering how closely can we expect unknown samples to follow any given 1kg clusters. My red dots, although we believe they are hispanic, seem to form a group all of their own, which could be because they all have the same cancer, but I would have thought they should cluster more closely with the AMR super-population, no?

Is there a way to verify how accurate my PC projection is?

my example PCA plot

ADD REPLYlink modified 3 months ago • written 3 months ago by omg what am I doing70

Hey, looks like you at least got through the tutorial successfully - it can be difficult, particularly when overlaying your own samples. From my experience, and I have done this a good few times, one's own samples can align quite well with the 1KG samples, once you have done the preparatory work correctly.

The AMR (Ad Mixed American) group is clearly the most diverse in the cohort, but it is sampling from quite a diverse pool:

  • MXL Mexican Ancestry from Los Angeles USA
  • PUR Puerto Ricans from Puerto Rico
  • CLM Colombians from Medellin, Colombia
  • PEL Peruvians from Lima, Peru

[ https://www.internationalgenome.org/category/population/ ]

There's a potential that those (your) samples could be verging into the African group, too. I have also seen Brazilian samples group in that region, previously.

If I were you, I would check PC1 vs PC3, too, as the AMR group should separate out on its own plane along PC3.

It does genuinely work very well, though. I built a predictive model from the eigenvectors previously, and made predictions on samples of known and unknown ethnicity. The model had >99% sensitivity and specificity - A: How to predict individual ethnicity information by using hapmap data

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe63k

hello i want to plot an pca just like u but with my data can you give me some advice on what to do?

ADD REPLYlink written 3 months ago by 17bcs0300

Hi, I need to see examples of your data and understand what you currently have. Otherwise, I cannot do anything.

ADD REPLYlink written 3 months ago by Kevin Blighe63k

hello kevin i also want to merge my data(data of 10 persons) with the 1000 genome data and want to plot a pca like the one above.the categories will be the existing ethnicity 'Population 1', 'Population 2', 'Population 3', 'Population 4', 'Population 5' and a label "my data" the plot will contatin pca of 1000 genome data along with my points (my samples). similar in the manner the user omg what i am doing has done. what i have to do can you please advise me.

ADD REPLYlink modified 16 days ago by Kevin Blighe63k • written 3 months ago by 17bcs0300

I think that you basically repeat the process (above) independently for each dataset. When you come to step 9, however, there will be a few different steps:

  • Step 9a: merge the 1000 Genomes data as per the tutorial
  • Step 9b: find common variants between your dataset and the merged 1000 Genomes dataset (and filter both for these common variants)
  • Step 9c: merge the 1000 Genomes data with your own data
  • Step 10: perform PCA
ADD REPLYlink written 3 months ago by Kevin Blighe63k
1

I have now added these extra steps to the main tutorial.

ADD REPLYlink written 3 months ago by Kevin Blighe63k

hello i get this error at step 4 enter image description here

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by 17bcs0300

Hello. Can you confirm that every step prior to Step 4 has completed successfully?

ADD REPLYlink written 7 weeks ago by Kevin Blighe63k

yes every thing else was successfully completed enter image description here

ADD REPLYlink written 7 weeks ago by 17bcs0300

Are you sure? - I see no proof of that in your screenshot. What is the output of

ls -l
ADD REPLYlink written 7 weeks ago by Kevin Blighe63k

hello kevin ! enter image description here enter image description here

is there is some thing wrong with my computer ? i can also screen share with you or send u my data if i will help you to help me i have 10 individual vcf samples and want the pca plot with my data labeled as "our data" thanks alot kevin fr your support and help..i relly appreciate your efforts!! stay safe and stay healthy

ADD REPLYlink written 6 weeks ago by 17bcs0300

It looks like your BCF files have not yet been created. You need to re-run Step 4 (above)

ADD REPLYlink written 6 weeks ago by Kevin Blighe63k

Hello kevin

every step prior to Step 4 has completed successfully as you can see when i run ls -l command all the files are successfully downloaded enter image description here

When i run step 4 i get this error

ADD REPLYlink written 6 weeks ago by 17bcs0300

If you run the commands exactly as they appear in the tutorial, there should not be a problem. Can you confirm the shell that you are using.

Also, next time, can you show the value of:

ls -l | head

...and also paste the full command that you are using. When I say 'paste', can you paste the text here.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe63k

thanks a lot Kevin for helping me

ls -l | head gives this output

kk@DESKTOP-0SEI0MU:/mnt/d/bio$ ls -l |head
total 17711460

-r-xr-xr-x 1 kk kk     155869 Jun 21 20:53 20130606_g1k.ped
-r-xr-xr-x 1 kk kk 1216886729 Jun 21 03:36 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
-r-xr-xr-x 1 kk kk     224660 Jun 21 03:36 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
-r-xr-xr-x 1 kk kk  773788987 Jun 21 11:54 ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
-r-xr-xr-x 1 kk kk     133086 Jun 21 11:54 ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
-r-xr-xr-x 1 kk kk  767084423 Jun 21 13:52 ALL.chr11.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
-r-xr-xr-x 1 kk kk     133331 Jun 21 13:52 ALL.chr11.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
-r-xr-xr-x 1 kk kk  740805962 Jun 21 15:14 ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
-r-xr-xr-x 1 kk kk     133171 Jun 21 15:29 ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
ADD REPLYlink modified 4 weeks ago by RamRS28k • written 4 weeks ago by 17bcs0300

Hi! So, you still need to re-run Step 4. It has evidently not completed successfully due to the fact that there are no BCF files.

ADD REPLYlink written 4 weeks ago by Kevin Blighe63k

hello kevin i apologies for taking your time but i followed every thing you can see that all files were downloaded every thing worked till step 3 but command at step 4 was not able to execute you can check as i have attached screen shot please let me know what i can do.

Thank alot kevin

stay safe

ADD REPLYlink modified 29 days ago by RamRS28k • written 29 days ago by 17bcs0300

I am in good health. I trust that you are, too. Espero que estés bien.

If you run this, is there a different error message?

bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
  ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \
  bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \
    bcftools norm -Ob --rm-dup both \
      > ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;
ADD REPLYlink modified 29 days ago • written 29 days ago by Kevin Blighe63k

Was this resolved?

ADD REPLYlink written 23 days ago by Kevin Blighe63k

Hello 17bcs030,

Please refrain from using instant messaging/SMS jargon such as "fr", "u", etc. on professional and scientific forums. Write in complete sentences and use professional language.

ADD REPLYlink written 6 weeks ago by RamRS28k

I will try not to use these jargons i will use complete sentences and use professional language only, My sincere apologies

ADD REPLYlink written 6 weeks ago by 17bcs0300

Hi Kevin, thank you for your tutorial. I want to merge my own data with 1kg and plot PCA-bi plot, I have some questions for these: Q1:In "step7, Prune variants from each chromosome", why don't you exclude long high-LD regions? Isn't that necessary?

Q2: 1)a. Prune 1kg and my own dataset respectively ; b. Merge the two datasets 2) a. Merge the two datasets; b. Prune them Is there a difference between the two strategies?

Q3: Does 1kg data need to be quality control? e.g.: --geno --mind --maf

Q4: I‘m not very clear about this command: --indep 50 5 1.5 , if I removed same individuals or variants from my data, Do I need to regenerate a *prune.in file ? Thank you for your time in helping me with these questions.

ADD REPLYlink written 6 weeks ago by yuan0

Hey,

Q1:In "step7, Prune variants from each chromosome", why don't you exclude long high-LD regions? Isn't that necessary?

You can do that. Actually, pruning for LD is performed via the --indep flag (see your question 4)

Q2: 1)a. Prune 1kg and my own dataset respectively ; b. Merge the two datasets 2) a. Merge the two datasets; b. Prune them Is there a difference between the two strategies?

Yes, there is a difference, and one will obtain different results depending on the order of these operations. I have not tested which order is preferential over the other.

Q3: Does 1kg data need to be quality control? e.g.: --geno --mind --maf

It is already of good quality from source. However, in this tutorial, I provide a normalisation step to split multi-allelic calls in step 4.

Q4: I‘m not very clear about this command: --indep 50 5 1.5 , if I removed same individuals or variants from my data, Do I need to regenerate a *prune.in file ? Thank you for your time in helping me with these questions.

This is pruning based on LD. More information here: https://www.cog-genomics.org/plink/1.9/ld

It would be altered if samples were removed - yes.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe63k

Hi Kevin Blighe, thanks for this incredible tutorial! I was wondering: having obtained the plink.eigenval and plink.eigenvec files using plink's --pca function, is there an objective way of identifying how many population covariates should be included in downstream genetic analyses (e.g. generation of polygenic risk scores or eQTL work)? I am following this other post, but I am not sure I understand what's this objective way. Should I analyse PC 1 by 2, 2 by 3, 3 by 4, 4 by 5 and so on, until I find a normal distribution? Is there a "numeric" score per PC that could be useful to make this more objective (e.g., include only principal components as covariates if they are associated with eigenvalues > 1.5?). Any clarification on this would be much appreciated!

Also, plink automatically calculates 20 PCs. If I ask plink to calculate 10 PCs, will the first few PCs be different between the two different runs (when I asked plink to calculate 10 vs when it calculated 20)? I'll be testing this later on so I can answer this one in a while - I am assuming that values will change.

Finally, if I have a variable like "genotype chip name" (samples from different institutions were genotyped using a different SNP chips), is this something I can include as a factor in the calculation of my PCs? (Or is this something that will drive the PC values I obtain anyway, so no need to do this?)

Muito obrigado, caríssimo!! And I am sorry if these questions are too basic.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by rodd100
1

Boa tarde! Some people just inspect the bi-plots and make a decision about PCs to include (as covariates) in that way. In my opinion, however, it is better to regress the PCs against population, and then retain the statistically significant PCs as covariates. For example:

glm(population ~ PC1, ...)
glm(population ~ PC2, ...)
...
glm(population ~ PC[x], ...)

, or:

lm(PC1 ~ population, ...)
lm(PC2 ~ population, ...)
...
lm(PC[x] ~ population, ...)

The values to use for the PCs should be in the eigenvec file (?). The PCs to retain are those that pass p<0.05.

Also, plink automatically calculates 20 PCs. If I ask plink to calculate 10 PCs, will the first few PCs be different between the two different runs (when I asked plink to calculate 10 vs when it calculated 20)? I'll be testing this later on so I can answer this one in a while - I am assuming that values will change.

I am not sure, but I I think that the first 10 will always be the same.

Finally, if I have a variable like "genotype chip name" (samples from different institutions were genotyped using a different SNP chips), is this something I can include as a factor in the calculation of my PCs? (Or is this something that will drive the PC values I obtain anyway, so no need to do this?)

This is likely a batch covariate, yes, and it may be altering the PCA calculations. The biggest worry is that the different chips / arrays may be genotyping different SNPs, or different strands (coding / non-coding; plus / minus) of the same SNP. Some upstream pre-processing (before PCA) should be performed to ensure alignment across the different chips, I think.

ADD REPLYlink written 5 weeks ago by Kevin Blighe63k

Hello Kevin, I´m sorry for bothering you. I´m trying to convert 1000 genomes vcf file (ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes) to plink format and merge it with my in-house dataset (already in plink format) so I can perform admixture analysis and maybe PCA also. I would need to follow all the steps you detailed here? What would be the problem if I just do the conversion of the vcf with plink --vcf ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes --out cromX1KG ? I would be very gratefull with your clarification!! Iriel

ADD REPLYlink written 23 days ago by irieljoerin10

Hey, no problem. In this tutorial, I do not focus on any allosome (chrX or Y); however, the data is available at the FTP sites:

Yes, you can follow some of the steps here in order to merge your data with the 1000 Genomes chrX data, primarily Steps 4 + 5, and then follow advice Here

ADD REPLYlink modified 23 days ago • written 23 days ago by Kevin Blighe63k
1

Thanks for the answer! I will try to follow your instructions. Iriel

ADD REPLYlink written 23 days ago by irieljoerin10

De nada! There is technically no right or wrong way in which to merge the datasets. The 'merge' can occur at different stages of processing. Also, you may not want to do, for example, Step 7, from my tutorial above.

ADD REPLYlink modified 23 days ago • written 23 days ago by Kevin Blighe63k

Hi Kevin, I have tried steps 4 and 5 of your tutorial on chromosome X vcf from 1000 genomes, then I filtrate my in house dataset of chromosome X SNPs to only keep SNPs on the + strand (according to Illumina microarray files). I also found the common SNPs using the intersect function from R and extracted those common SNPs from each dataset.

The problem comes out when I try to merge both datasets with plink

plink --bfile chrXKG2_int --bmerge cromXplus_int.bed cromXplus_int.bim cromXplus_int.fam --make-bed --out cromXmerged

Plink throws an error saying that there are several variants with 3+ alleles present, that this could be due to strand inconsistency.

I think I should do the same selection of + strand genotyped SNPs in 1000 genomes dataset, do you have any command to do that?

Thanks in advance for your answer!

ADD REPLYlink modified 16 days ago by RamRS28k • written 17 days ago by irieljoerin10
1

Yes, you would need to filter the 1000 Genomes dataset for these coding (+) strand variants too. I believe the PLINK flag to do this is --exclude

ADD REPLYlink modified 16 days ago • written 16 days ago by Kevin Blighe63k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 772 users visited in the last hour