As part of my RNA-Seq project analysis I wrote a program to organize a gene expression table for subsequent analyses. The reason I wrote the program is because cufflinks/cuffnorm output (1) amalgamated some separate genes being in close proximity to each other into one XLOC id (2) some genes occurred more than once in the list and at each occurrence had a different XLOC despite having the same gene name. I figured that the possible reason for that is because the same gene may have different TSS and hence will have different XLOC id. Therefore the final situation was that the same gene would have different FPKM expression values in each XLOC.
The program I wrote (1) separates genes sharing the same XLOC and assigns to them the original FPKM that was reported by cuffnorm per XLOC (2) identifies genes that occur multiple times and averages their expressions as reported in different XLOC ids. And here is my crucial question:
I noticed that these FPKM expressions can vary significantly between XLOCs. So for example imagine this situation:
gene_id gene_name sample 1 sample 2
XLOC_1 funnyGen 20 1
XLOC_3 funnyGen 2 1
Now, if we were to average the data as I wrote the program we end up with:
gene_name sample 1 sample 2
funnyGen 11 1
It seems to me the data could be significantly skewed. Is yoru advise then to average the FPKM data or perhaps only add them to each other. The letter scenario may be better in the type of scenario described above but averaging may be better for other. And hence I'm undecided for which option to go, an option that I can apply to the whole dataset.