Hi everyone,
I'm looking for a GWA algorithm for copy number variation (CNV) data.
1) Data
A reference-based collection of several DNA segments (e.g. genes) that have different occurrences in my analyzed dataset (A. thaliana). I'm relatively open to data formats since I have all information needed to convert it. At the end its something like:
seg1 seg2 seg3
sample1 0 2 1
sample2 5 1 3
2) Knowlege
I've done GWAS in the past and was normally using GEMMA or EMMA. These algorithms are fast enough for my small (~100-1000) sample size and gave good results. GEMMA and other GWAS methods use the plink bed format which represents binary allele information.
3) Workaround
I'm aware that I could trick "normal" GWAS methods by just comparing the following:
0 vs #seg1>0
#seg<2 vs #seg>=2
OR
#seg==2 vs #seg!=2
This would include comparing all possible combinations and I'm not sure if it would give me the right solution.
4) What I'm looking for:
GWAS algorithm which incorporates more than binary occurrence. I want to know if having 2 copies of a segment has a significant effect on the phenotype. Does anyone know a suitable method for this problem?
Hand in hand with this question: Why do we "ignore" alleles with low frequencies? Are these not important?
Thanks, Sebastian
Are you using LMM? If you're not taking the kinship matrix into account (or willing not to), you can just treat it as a linear model and solve it in R.
GEMMA can accept bimbam file format which is basically flat file, you can specify 0/1/2 as genotype, I'm not sure if more than 2 is an option but maybe it's enough for your needs.
Hi, Thank you for your reply! Normally, I'm using LMM since the population structure has an effect in A. thaliana, but I will try this to see if it fixes the problem for now. Unfortunately, in most cases I'm having more than 3 possible outcomes in my CNV dataset...