Question: Extracting Cumulative (Rare) Allele Frequencies for a Subset of Variants From a VCF
1
gravatar for maegsul
8 months ago by
maegsul130
Belgium
maegsul130 wrote:

Hi,

I am conducting rare variant tests for predefined genomic units (such as genes) genome-wide, each containing a subset of variants from a multi-sample VCF. I used SKAT-O for that purpose in R, that functions reasonably well to output the rare variant test association p-values per genomic unit.

However it does not report the cumulative allele frequencies of tested group of variants within a unit (let's say) in patient and controls; so I need to go back to the VCF and do it manually myself to learn about the cumulative frequencies and the enrichment of rare variants (in what direction, with what effect size?)

I am trying to come up with a script to solve this, but I wasn't successful so far. Is there any script/tool that can you suggest for that purpose?

For example, what I want is something like below:

1) Look at my table of 1st column of Gene names and 2nd column of variant IDs, find the variants for "BRCA1"
2) Look for BRCA1 variants in the VCF in a subset of samples (e.g. cancer patients + healthy controls, all in the same VCF [for this phenotypic information, use my table that indicates a binary phenotype for each sample, such as PLINK .fam/.ped file])
3) Calculate the cumulative frequencies of these mutations for each group
4) Output these frequencies and an enrichment ratio between patients vs controls

It'd be great if I can do this for a group of genes in a loop -> outputting a table containing info for each.

Any suggestions?

ADD COMMENTlink modified 8 months ago by Pierre Lindenbaum118k • written 8 months ago by maegsul130
2
gravatar for Pierre Lindenbaum
8 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

1) Look at my table of 1st column of Gene names and 2nd column of variant IDs, find the variants for "BRCA1" 2) Look for BRCA1 variants in the VCF in a subset of samples (e.g. cancer patients + healthy)

I wrote a tool for to filter per gene: http://lindenb.github.io/jvarkit/VcfBurdenFilterGenes.html , Use 'bcftools view" before to quickly get the variants in that region

3) Calculate the cumulative frequencies of these mutations for each group

I wrote something like this: http://lindenb.github.io/jvarkit/VcfBurdenMAF.html (sorry, it's quite localized to my lab, there's not much documentation)

4) Output these frequencies

use bcftools query

ADD COMMENTlink written 8 months ago by Pierre Lindenbaum118k

Thanks a lot Pierre! Your code helped me, but what I was looking for was something that is a bit different.

I want to query the (rare) variants falling into a group (usually a gene name) from a .tsv file, not from SnpEff annotation directly. For these variants, I want to obtain cumulative MAF (so what's the frequency of the total alternative alleles of these variants in the total possible successful calls within a group [total n of variants x n of Patients (-minus "./." missing genotypes) x 2]). Then I will get the ratio of MAFcumulative_patients vs MAFcumulative_controls, for instance.

Can it be also done with your scripts + bcftools query?

ADD REPLYlink modified 8 months ago • written 8 months ago by maegsul130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1914 users visited in the last hour