Extracting Cumulative (Rare) Allele Frequencies for a Subset of Variants From a VCF
1
1
Entering edit mode
5.8 years ago
maegsul ▴ 170

Hi,

I am conducting rare variant tests for predefined genomic units (such as genes) genome-wide, each containing a subset of variants from a multi-sample VCF. I used SKAT-O for that purpose in R, that functions reasonably well to output the rare variant test association p-values per genomic unit.

However it does not report the cumulative allele frequencies of tested group of variants within a unit (let's say) in patient and controls; so I need to go back to the VCF and do it manually myself to learn about the cumulative frequencies and the enrichment of rare variants (in what direction, with what effect size?)

I am trying to come up with a script to solve this, but I wasn't successful so far. Is there any script/tool that can you suggest for that purpose?

For example, what I want is something like below:

1) Look at my table of 1st column of Gene names and 2nd column of variant IDs, find the variants for "BRCA1"
2) Look for BRCA1 variants in the VCF in a subset of samples (e.g. cancer patients + healthy controls, all in the same VCF [for this phenotypic information, use my table that indicates a binary phenotype for each sample, such as PLINK .fam/.ped file])
3) Calculate the cumulative frequencies of these mutations for each group
4) Output these frequencies and an enrichment ratio between patients vs controls

It'd be great if I can do this for a group of genes in a loop -> outputting a table containing info for each.

Any suggestions?

vcf filter rare variant tests enrichment • 1.9k views
ADD COMMENT
2
Entering edit mode
5.8 years ago

1) Look at my table of 1st column of Gene names and 2nd column of variant IDs, find the variants for "BRCA1" 2) Look for BRCA1 variants in the VCF in a subset of samples (e.g. cancer patients + healthy)

I wrote a tool for to filter per gene: http://lindenb.github.io/jvarkit/VcfBurdenFilterGenes.html , Use 'bcftools view" before to quickly get the variants in that region

3) Calculate the cumulative frequencies of these mutations for each group

I wrote something like this: http://lindenb.github.io/jvarkit/VcfBurdenMAF.html (sorry, it's quite localized to my lab, there's not much documentation)

4) Output these frequencies

use bcftools query

ADD COMMENT
0
Entering edit mode

Thanks a lot Pierre! Your code helped me, but what I was looking for was something that is a bit different.

I want to query the (rare) variants falling into a group (usually a gene name) from a .tsv file, not from SnpEff annotation directly. For these variants, I want to obtain cumulative MAF (so what's the frequency of the total alternative alleles of these variants in the total possible successful calls within a group [total n of variants x n of Patients (-minus "./." missing genotypes) x 2]). Then I will get the ratio of MAFcumulative_patients vs MAFcumulative_controls, for instance.

Can it be also done with your scripts + bcftools query?

ADD REPLY

Login before adding your answer.

Traffic: 2571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6