I have a question involving assemblies and checking the quality of the contigs in those assemblies.
The data i have consists of Pair ended reads, i merged them using PEAR. The reads that were left unassembled by PEAR because they did not have enough overlap were fed to SOAP denovo for assembling. The two assemblies were merged. This gave me a large set of contigs and to check their quality i mapped my reads back to them. The data then looks something like this;
The distribution looks something like this
On the left is the distribution with the contig that has 3000x coverage, and on the right i used boxplot to find outliers and removed them. It is pretty apparent that contig 7 is an outlier, but the way i removed it is not a proper way because the boxplot method is only useful for a normally distributed dataset and this looks more like log normally distributed.
I have tried using fitdistr() to find the exact type of distribution but it does not produce a definitive answer.
Our statistician also mentioned using generalized linear models, then take the deviance and divide by degrees of freedom to get the distribution. and the closest to 1 is the distribution which i can use to determine the type of outlier detection i need... but i can't get this to work properly (see example down below)
glm1 <- glm(contig_coverage~1,data=data_set, family = gaussian(log))
glm2 <- glm(contig_coverage~1,data=data_set, family = poisson(log))
glm3 <- glm(contig_coverage~1,data=data_set, family = Gamma(log))
Can anybody help me with a rule/method/function with which i can find out which contigs have improbable coverage in R so that i can remove them from the final assembly?