I'm wondering if anyone has coming across anything like the following? I'm looking at gene-length bias in the DepMap RNA-seq counts matrix processed by the GTEx pipeline, and noticed that there is a negative relationship between gene length and counts at high gene lengths. At low gene lengths, the expected gene-length bias is seen with counts increasing with increasing gene length. At high gene lengths (>50k), this relationship inverts and counts starts decreasing with increasing gene length. Anyone seen this before or have any ideas why this may be?
Thanks :)
Left is pre-normalisation w/ EDASeq, right is post-normalisation.
What do we see in the plots? What is "gene counts" (and on which scale is this, log2?) and what are the lines? Are these samples? Code for this would help.
Sorry, these were generated with EDASeq biasPlot() function.
They are loess lines of log-counts against gene length. Yes each line is an individual sample. Low count genes are filtered with limma's filterByExpr() function.
If these are genomic lengths, you wouldn't expect longer genes to have more reads per se.
If these are transcript lengths, how many genes are you actually measuring larger than 50k? See a couple of reference examples here. For me, there's only a few genes that are longer, so naturally there's more variability which could explain differences in bias observed. I usually adjust for this with log transformation.