I'm currently analyzing a set of 30 samples (6 conditions, 5 replicates each) in bioconductor bayseq. I mapped the 30 sets of +- 20M 50bp SE illumina reads against 23321 reference sequences that were pre-selected from a de-novo assembled transcriptome. This selection of contigs was based on a number of criteria, including whether or not my RNAseq reads actually mapped onto them. Here comes my first question: I applied different selection criteria depending on whether I had annotation info or not: I kept any transcript that got reliable annotation info and some mappings, while for the transcripts that I did not manage to annotate, I required a lot more mappings to "confirm" that they were interesting. This yielded a bimodal distribution in my counts / TPM (that I got with the combo bowtie2/RSEM). See this picture
First question A: would this bimodal distribution be a problem for BaySeq?
First question B: I'm stuck with two deviant distributions (the red lines), and even when I disregard those, I still have quite some "spread" on the count distributions. I would like to try keeping the red ones on board if possible, but if they would hamper my inferences I will kick them out. Now, does anybody here have experience with some tougher between-sample normalizations before moving on with normal BaySeq? For example, I could do something like:
library(limma) mydat <-log2(mydat+1) foo <-normalizeBetweenArrays(mydat,method="quantile") foo <-round((2**foo)-1)
This operation really sticks all densities on top of each other, but is probably a bit over the top. However, something like the cyclicloess transformation seems like a reasonable candidate. Any suggestions about this strategy? Am I being stupid or should I give it a try? I guess the rounding is a bit dangerous for low counts?
My second question goes somewhat more into the technicalities of BaySeq. I made a small toy example where I only compare two conditions. After estimating the priors, I made the following plot:
CD <- getPriors.NB(CD, samplesize = 10000, estimation = "QL", cl = cl) plot(log(CD@priors$priors$DE[][,1]),log(CD@priors$priors$DE[][,2]))
I get the following picture
Can anybody tell me exactly what I'm seeing here? I guess these are parameter estimates for two parameters of the neg. binom. prior distribution, per contig. Right? Why does it show such a strong grouping? Any suggestions? By the way, the priors for the other condition (NDE) look virtually the same.
Well, hope to see some interesting suggestions/ discussion :) Thanks in advance! Rik