Question: VCF file analysis to find genome wide association
gravatar for arash.iranzadeh1980
2.7 years ago by
arash.iranzadeh198030 wrote:


I have a huge vcf file (2.6 GB) including 1500 individuals from three phenotypes (500 individual from each phenotype). It contains 203846 SNPs and 7174 INDELs. The organism is pneumococcus which is a haploid bacterium. I want to perform a genome wide association study in order to find significant variants associated with any phenotype. Any suggested pipeline for this analysis? I would be happy if someone could address some tutorials here that I can study.

Cheers Arash

ADD COMMENTlink modified 2.7 years ago by Kevin Blighe60k • written 2.7 years ago by arash.iranzadeh198030
gravatar for Kevin Blighe
2.7 years ago by
Kevin Blighe60k
Kevin Blighe60k wrote:

I would normalise the VCF file and convert it to PLINK format as per my thread HERE, and then run standard association tests in PLINK. You will have to create the FAM file yourself in order to define your phenotypes. Pay close attention to my comment after the tutorial regarding the --indiv-sort command-line parameter - if you don't control the ordering when you read in your BCF file to PLINK, then PLINK Will distort your sample order without giving any warning, and your custom FAM file may then be out of sync.

I built an entire pipeline for this process but I'll let you do the piecing together. If you get your VCF file to the normalised BCF stage from my tutorial, which is not difficult, then you can do something like:

plink --noweb --bcf pseudomonas.bcf --keep-allele-order --indiv-sort file MySampleOrder.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out pseudomonas ;

#Clean the dataset by filtering out variants genotyped in just 12.5% individuals
#"To include only SNPs with a 95% genotyping rate (5% missing) use --geno 0.05
plink --noweb --bfile pseudomonas --geno 0.05 --make-bed --out pseudomonas.Clean ;

#Perform different statistical comparisons
plink --noweb --assoc --ci 0.95 --counts --bfile pseudomonas.Clean --fam MyCustom.fam --out pseudomonas.Clean ;

#Filter to select only significant variants (column #9), and/or those not found in controls (column #6)
awk '{if ($9<=0.05) print $0}' pseudomonas.Clean.assoc > pseudomonas.Clean.assoc.filt ;
awk '{if ($6==0 && $9<=0.05) print $0}' pseudomonas.Clean.assoc > pseudomonas.Clean.assoc.filt.NoControls ;

In PLINK, you can also do other tests, including logistic regression. Your sample numbers look pretty good.

ADD COMMENTlink written 2.7 years ago by Kevin Blighe60k

Thank you very much, seems useful.

ADD REPLYlink written 2.7 years ago by arash.iranzadeh198030

Kevin- if you sequenced a mixed population of bacteria that was selected for an improved phenotype (each sequencing read representing a separate individual)- is it possible to identify the significant variant associated with the phenotype? Is that what was done here?

ADD REPLYlink modified 10 months ago • written 10 months ago by kristina.mahan50

Hey, I am not sure, exactly. The person who posted this original question appeared to want to identify statistically significantly different variants / mutations between their 3 Streptococcus pneumoniae ('pneumococcus') groups. Which type of sequencing have you performed? - cDNA or DNA?

ADD REPLYlink written 10 months ago by Kevin Blighe60k

Illumina DNA sequencing - providing about 200x coverage. Hoping that since this is a mixed population that was selected for a specific improvement in phenotype- that the causative variant would be more dominant in the population. I have VCF files that have thousands of variants that I would like to narrow down further.

ADD REPLYlink written 10 months ago by kristina.mahan50

Hey, I am not quite sure. One would indeed expect the causative variant to have high penetrance in the population, but other 'passenger' variants likely also exist that have equal or greater penetrance. I am not sure of the best way to go further other than setting a cutoff for variant allele frequency and then attempting to provide functional interpretation on the variants. Prokaryotic bioinformatics is not exactly my area though - have you considered asking the question on Biology StackExchange, or even opening up a new question here on Biostars for other ideas?

ADD REPLYlink written 10 months ago by Kevin Blighe60k

okay I will do that!

ADD REPLYlink written 10 months ago by kristina.mahan50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1267 users visited in the last hour