Question

What is the best way to perform a PCA on .vcf of closely related bacteria?

0

Entering edit mode

3.0 years ago

d.s.account • 0

Since bacterial variant calling files (.vcf) do not have GT entries, I got stuck while using PLINK.

What options are available to still perform a PCA on this prokaryotic data? (pipeline? tutorial?) By the way, the .fastq sequencing file is obtained from a mixed culture.

My data specification: I have mapped raw reads obtained from evolved E. coli genome (.fastq file) to my reference genome. I did this with a few different samples. I want to see the how different or similar the samples are, respective their variants (.vcf). Thus, I exported the .vcf, which looks something like the header example shown below (total >50 lines):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Variants:_LGC19-XL01_S63_L001_R_001_(trimmed)   Variants:_Trimmed_LGC19-VP03_S3_L001_R_001
CFC381_K12_Bw25113  172181  .   T   G   70.56   .   NS=1;RF=0.829;VF=0.171;SB=1;SB50=0.031;SB65=0.15;TYPE=SNP(transversion);CDSCN=196;CDSP=588;CDSPWC=3;CDS=clcACDS;CDNCHG=GGT->GGG;AACHG;PE=None;AVQUAL=22 DP:AO   .:. 35:6
CFC381_K12_Bw25113  269662  .   T   C   100.41  .   NS=1;RF=0.774;VF=0.226;SB=0.81;SB50=0.0072;SB65=0.18;TYPE=SNP(transition);CDSCN=369;CDSP=1105;CDSPWC=1;CDS=ykfCCDS;CDNCHG=TTA->CTA;AACHG;PE=None;AVQUAL=14  DP:AO   .:. 93:21

Now I have merged many of these .vcf (by bcftools).

variant-calling • 1.6k views

ADD COMMENT • link updated 28 days ago by Ram 43k • written 3.0 years ago by d.s.account • 0

score 0 · Answer 1 · 2021-04-28

After I did some research, I could run the protocol below to finally get MyVCF.bed, MyVCF.bim, MyVCF.eigenval, MyVCF.eigenvec, MyVCF.fam, MyVCF.log, MyVCF.nosex. These files can be plotted by [R] to get the PCA.

I used two plink2 commands on a merged .vcf to receive these files:

plink2 --vcf MyVCF.vcf --make-bed --allow-extra-chr --double-id --out MyVCF
plink2 --bfile MyVCF --pca --allow-extra-chr --double-id --out MyVCF

[Timing: Seconds (for a merged file of 4 individua)]

However, you need to have this MyVCF.vcf file first, which is essentially several variant calling files from that were merged. But to make the microbial .vcf compatible with PLINK, the .vcf must first be extended with the GT type, subsequently gzipped so that it can be merged by bcftools (you need to have bcftools installed), and merged. So the protocol starts here!

awk -F'\t' -v OFS="\t" '{ if(/^#/){print}else{$9="GT:"$9;$10="1:"$10;print}}' Freddy.vcf > FreddyG.vcf

This converts each line in the file such that a GT-field is added. Then you also need to manually change the header: Its okay to just insert the next line (e.g. above: ##FORMAT=<ID=DP,Number=1,Type )

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

For the curious, the correct format can be found here: https://www.internationalgenome.org/wiki/Analysis/vcf4.0/

bgzip -c FreddyG.vcf > FreddyG.vcf.gz
bcftools index FreddyG.vcf.gz

if one of your .vcf file causes an error like this:

failed to create index for "FreddyG.vcf.gz" Unsorted positions on sequence #1: 3372253 followed by 3372252

it can be solved by the console and using a so called "subshell". First print the header and then the variant lines, sorting by chromosome and position, execute code below and use "bgzip" and "bcftools index" afterwards:

(grep ^"#" INPUT.vcf; grep -v ^"#" INPUT.vcf | sort -k1,1 -k2,2n) > sorted.vcf

Then you want to merge .vcf of different individua, do the following :

bcftools merge FreddyG.vcf.gz sorted.vcf.gz FreddyG_three.vcf.gz FreddyG_four.vcf.gz --output MyVCF

Now you can execute the two plink2 jobs that were listed first.