Question: Filter out genetic marker with Minor Allele Frequency
gravatar for mab658
7 months ago by
mab65820 wrote:

A genotype data for GWAS is provided as 0, 1, and 2 format where individuals are rows and each column is a SNP. The minor allele frequency (MAF) is calculated as

MAF <- apply(geno, 2, function(x) sum(x) / (length(x) *2))

where the geno is a matrix object for the genotype data. The calculated MAF ranged from 0.05087209 to 0.94912791. My threshold is to remove SNP with MAF of 5%. Am I right with this R script:

geno_filtered <- geno[, which(MAF > 0.05)].

2) Is it possible to know which of the coded SNP 0,1, and 2 is minor based on above calculation?

3) How do I test if any of the SNP is in Hardy-Weinberg Equilibrium (HWE)? If any of them is not in HWE, is it advisable to filter it out as well?


sequencing snp R gene • 701 views
ADD COMMENTlink modified 7 months ago by zx87546.1k • written 7 months ago by mab65820
  1. Why is this a Forum discussion and not a Question?
  2. Is this an assignment question?
ADD REPLYlink written 7 months ago by RamRS19k

Thanks for calling my attention to this. You mean it is good to post my question on question forum, right? May be that is why I dont get much response

ADD REPLYlink written 7 months ago by mab65820

Your question was posted as a Forum type post, not a Question type post. See Biostar Forum Posting Guidelines for a primer on the types of posts.

You did get a response - how actively you follow up also determines the quality and frequency of help you'll get. Please read to better understand how open science forums work.

ADD REPLYlink written 7 months ago by RamRS19k
gravatar for zx8754
7 months ago by
zx87546.1k wrote:

A simpler version of MAF function would be:

getMAF <- function(m) colMeans(m) / 2

Yes, it is possible to find out minor, we need to match frequency with counts, but doesn't always work:

# dummy genotype with 3 SNPs
geno1 <- matrix(c(rep(0, 100), rep(1, 100), rep(2, 100),
                  rep(0, 10), rep(1, 140), rep(2, 150),
                  rep(0, 150), rep(1, 140), rep(2, 10)
                  ), ncol = 3)

# get maf
# [1] 0.5000000 0.7333333 0.2666667

# get counts
lapply(data.frame(geno1), table)
# $X1
# 0   1   2 
# 100 100 100 
# $X2
# 0   1   2 
# 10 140 150 
# $X3
# 0   1   2 
# 150 140  10

From this example, SNP1 is impossible to know as maf is 50%. SNP2 minor is 0, SNP3 is MAF 0.73, not really minor (so we need to flip 1 - 0.73), but from counts we can see it is 2.

Also, instead of re-inventing wheels, try to convert your data that matches with input for existing tools, for example R HardyWeinberg package.

Or convert to plink format, etc.

ADD COMMENTlink modified 7 months ago • written 7 months ago by zx87546.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1626 users visited in the last hour