I am trying to perform a regression analysis of CNV genotype vs. clinical phenotype for a data set of cancer patients, and need to find out the best way to deal with the issue of CNV overlap.
The CNV data files typically provide chr, start position, stop position, and score for each "locus." The problem is that the start/stop for one "locus" may contain/overlap with the start/stop of other loci to varying degrees. For instances, one locus may be Chr 1 start=1000, stop=10000, another may be Chr 1 start = 5000, stop = 15000.
Presumably each "locus" listed corresponds to a probe, and treating each probe as an independent predictor variable doesn't make sense, because that would involve counting the same duplicated region multiple times with multiple overlapping probes.
Is there a canonical way around this problem, i.e. of using "average" CNV scores weighted by proportion of overlap?
Thanks in advance for any advice and references on this matter.
I am trying to predict the CNVs for three rice genomes using three different softwares like pindel, cnvnator and breakdancer. I would like to know if we find overlap between CNVs reported by two softwares(out of three), should we take only overlapping region for wet lab study or from smallest start coordinate to largest coordinate?