Question

Some questions about bayseq DE

0

Entering edit mode

7.6 years ago

Rik Verdonck ▴ 50

Hello everybody,

I'm currently analyzing a set of 30 samples (6 conditions, 5 replicates each) in bioconductor bayseq. I mapped the 30 sets of +- 20M 50bp SE illumina reads against 23321 reference sequences that were pre-selected from a de-novo assembled transcriptome. This selection of contigs was based on a number of criteria, including whether or not my RNAseq reads actually mapped onto them. Here comes my first question: I applied different selection criteria depending on whether I had annotation info or not: I kept any transcript that got reliable annotation info and some mappings, while for the transcripts that I did not manage to annotate, I required a lot more mappings to "confirm" that they were interesting. This yielded a bimodal distribution in my counts / TPM (that I got with the combo bowtie2/RSEM). See this picture

First question A: would this bimodal distribution be a problem for BaySeq?

First question B: I'm stuck with two deviant distributions (the red lines), and even when I disregard those, I still have quite some "spread" on the count distributions. I would like to try keeping the red ones on board if possible, but if they would hamper my inferences I will kick them out. Now, does anybody here have experience with some tougher between-sample normalizations before moving on with normal BaySeq? For example, I could do something like:

library(limma)
mydat <-log2(mydat+1)
foo   <-normalizeBetweenArrays(mydat,method="quantile")
foo   <-round((2**foo)-1)

This operation really sticks all densities on top of each other, but is probably a bit over the top. However, something like the cyclicloess transformation seems like a reasonable candidate. Any suggestions about this strategy? Am I being stupid or should I give it a try? I guess the rounding is a bit dangerous for low counts?

My second question goes somewhat more into the technicalities of BaySeq. I made a small toy example where I only compare two conditions. After estimating the priors, I made the following plot:

CD <- getPriors.NB(CD, samplesize = 10000, estimation = "QL", cl = cl)
plot(log(CD@priors$priors$DE[[1]][,1]),log(CD@priors$priors$DE[[1]][,2]))

I get the following picture

Can anybody tell me exactly what I'm seeing here? I guess these are parameter estimates for two parameters of the neg. binom. prior distribution, per contig. Right? Why does it show such a strong grouping? Any suggestions? By the way, the priors for the other condition (NDE) look virtually the same.

Well, hope to see some interesting suggestions/ discussion :) Thanks in advance! Rik

RNA-Seq R next-gen bayseq bioconductor • 1.9k views

ADD COMMENT • link updated 7.6 years ago by tjh48 ▴ 20 • written 7.6 years ago by Rik Verdonck ▴ 50

score 2 · Answer 1 · 2016-10-04

Hi Rik, thanks for your interest in baySeq.

To answer your questions: A: No, that shouldn't present a problem for baySeq. The way posterior likelihoods are estimated based on the empirically determined priors, it is the local behaviour of gene expression that has the most effect on the estimated likelihood, i.e., evaluation of likelihood of differential expression in genes in the left peak will mostly be based on observations of expression in the left peak. You might observe a slight tendency to assign higher weight to differential expression that crosses the peaks, all other things being equal, but I don't think this is any cause for concern.

B: To me, the red lines look shifted by a more or less constant amount - have you tried just applying a library scaling factor correction to the plot? There's no technical reason why you can't apply a strong normalisation and run baySeq on the rounded values you get, but to me those densities actually look pretty similar - I'd be happy to run baySeq with only the built-in library scaling correction on those data based on that plot.

For your second question, yes, those are the rescaled mean expression and dispersion estimates of a negative binomial, for a single replicate condition. The grouping you observe is primarily a split on the dispersion parameter (y-axis). This occurs because for some data, there is effectively no (observable) dispersion, that is, no additional variation above Poisson. This occurs more commonly for lowly expressed genes, since for these genes there aren't enough observations to quantify the dispersion - e.g., if the observed gene counts are 1,1,1,1 then there is no (estimable) dispersion, and this is more likely to occur than an observation of 1000, 1000, 1000, 1000, so the cluster forms in the bottom left quadrant.

Hope that's all clear - please let me know if you have any further questions.

Tom