Question: Running 1.5M potentially different generalized linear models depending on distribution of read depth information to study CNV
4.4 years ago by
Vincent Laufer1.1k
United States
Vincent Laufer1.1k wrote:

Background (don't need help on these sections, yet): I have read depth information on ~300 whole genomes. I am aware of many pitfalls of analyzing read depth as a proxy for CNV and have taken many steps to obtain quality-controlled read depth information that I am ready to analyze.

With this read depth data, I want to look for associations between this standardized, QCed, read depth information and my phenotype of interest in a covariate-controlled analysis.

However, I have been looking at the distributions of read depth information by window. Looking across windows, these windows have a distribution, but looking within window, there are (sometimes very) different distributions per window.

If the windows were all distributed the same, I could for instance run a poisson regression 1.5M times and be done. However, they are not. As such, the generalized linear model that I select should possibly be changed depending on the window to maximize power to analyze any given window.

Does anyone have experience automating the process of model fitting? Or is this inappropriate? Another method would of course be to use nonparametric analysis, but then I lose potentially very interesting information on the distribution of a given window.

4.4 years ago by
United States
Zev.Kronenberg11k wrote:

If this is a learning experience ignore the following advice.

There are many CNV callers that will model the read depth: -GenomeSTRiP -CNVKIT -WaveCNV -cn.mops ...

I would suggest trying a published tool.

Otherwise, The associated publications should help you figure out what model you want to use.

See also:

Thanks, Zev! It's both. I will read and mess around with these.

