Question

Extracting Cumulative (Rare) Allele Frequencies for a Subset of Variants From a VCF

1

Entering edit mode

5.8 years ago

maegsul ▴ 170

Hi,

I am conducting rare variant tests for predefined genomic units (such as genes) genome-wide, each containing a subset of variants from a multi-sample VCF. I used SKAT-O for that purpose in R, that functions reasonably well to output the rare variant test association p-values per genomic unit.

However it does not report the cumulative allele frequencies of tested group of variants within a unit (let's say) in patient and controls; so I need to go back to the VCF and do it manually myself to learn about the cumulative frequencies and the enrichment of rare variants (in what direction, with what effect size?)

I am trying to come up with a script to solve this, but I wasn't successful so far. Is there any script/tool that can you suggest for that purpose?

For example, what I want is something like below:

1) Look at my table of 1st column of Gene names and 2nd column of variant IDs, find the variants for "BRCA1"
2) Look for BRCA1 variants in the VCF in a subset of samples (e.g. cancer patients + healthy controls, all in the same VCF [for this phenotypic information, use my table that indicates a binary phenotype for each sample, such as PLINK .fam/.ped file])
3) Calculate the cumulative frequencies of these mutations for each group
4) Output these frequencies and an enrichment ratio between patients vs controls

It'd be great if I can do this for a group of genes in a loop -> outputting a table containing info for each.

Any suggestions?

vcf filter rare variant tests enrichment • 1.9k views

ADD COMMENT • link updated 5.8 years ago by Pierre Lindenbaum 161k • written 5.8 years ago by maegsul ▴ 170

score 2 · Answer 1 · 2018-07-06

2

Entering edit mode

5.8 years ago

Pierre Lindenbaum 161k

1) Look at my table of 1st column of Gene names and 2nd column of variant IDs, find the variants for "BRCA1" 2) Look for BRCA1 variants in the VCF in a subset of samples (e.g. cancer patients + healthy)

I wrote a tool for to filter per gene: http://lindenb.github.io/jvarkit/VcfBurdenFilterGenes.html , Use 'bcftools view" before to quickly get the variants in that region

3) Calculate the cumulative frequencies of these mutations for each group

I wrote something like this: http://lindenb.github.io/jvarkit/VcfBurdenMAF.html (sorry, it's quite localized to my lab, there's not much documentation)

4) Output these frequencies

use bcftools query

ADD COMMENT • link 5.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks a lot Pierre! Your code helped me, but what I was looking for was something that is a bit different.

I want to query the (rare) variants falling into a group (usually a gene name) from a .tsv file, not from SnpEff annotation directly. For these variants, I want to obtain cumulative MAF (so what's the frequency of the total alternative alleles of these variants in the total possible successful calls within a group [total n of variants x n of Patients (-minus "./." missing genotypes) x 2]). Then I will get the ratio of MAFcumulative_patients vs MAFcumulative_controls, for instance.

Can it be also done with your scripts + bcftools query?

ADD REPLY • link 5.8 years ago by maegsul ▴ 170