After having noticed some spurious results in my data set I wanted to contact this expert community here to get help with choosing the right normalization approach for my data.
I have two groups, patients and healthy controls, where microbiota OTUs have been measured from biopsies: Quality filtering was performed using SDM software and default criteria parameter adapted to the 454 sequencing platform using the LotuS pipeline. High-quality and midquality sequences were mapped to count the occurrence of OTUs in a single sample and clustering was done with UPARSE. The OTU sequences were then taxonomically assigned using Greengenes database34 (3.8, August 2013) and RDP II database35 (release version 11).
Now I want to use this data to correlate to host mRNA expression, preferably using Spearmans Ranks.
The default procedure in my lab is to normalize for sequencing depth by calculating ratios, but I think that ratios are not the ideal way to test my hypothesis, so Im looking into more useful alternatives. Also I have quite a number of columns that are either sum-zero or have very low variance, so just calculating ratios might blow up noise overproportiannly.
From all the options out there I think that Deseq2 or TMM, cumulative sum scaling or just subsampling by number of reads (multiplying all of the entries by (#reads in smallest sample)/(#reads in this sample)) would be best.
The thing is that we have a very low number of observations (around 30 per group) give difficulties of obtaining these samples, so im a bit hesitant with Deseq2.
Any input regarding this question would be highly appreciated, thanks in advance!