Question

RSEM/edgeR/Trinity: Unsorted raw matrix produce nonsense rpkm() results?

0

Entering edit mode

20 months ago

bioinfo2345 ▴ 40

When I run RSEM and then join together the different .isoforms.results files with per sample expression counts into a raw counts matrix (using Trinity script abundance_estimates_to_matrix.pl), this matrix is unsorted and does not carry over the same sorting as the isoforms.results files.

For instance, a raw counts matrix I have starts with the following:

   control1   control2   control3   control4   exp1   exp2   exp3   exp4
kf_t00010653-RA 28.00   31.00   34.00   33.00   92.00   100.00  75.00   97.00
kf_t00014892-RA 0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
kf_t00015024-RA 14.45   6.00    6.00    0.00    4.00    2.00    0.00    1.00

The RSEM *isoforms.results file has this sorting:

transcript_id   gene_id length  effective_length        expected_count  TPM     FPKM    IsoPct
kf_t00000001-RA kf_g00000001    5182    4933.94 191.91  5.86    5.34    100.00
kf_t00000002-RA kf_g00000002    2390    2141.94 125.88  8.85    8.07    100.00
kf_t00000003-RA kf_g00000003    654     405.97  0.00    0.00    0.00    0.00

This means that if you casually import the gene lengths into the DGEList object in edgeR like this (length is a table generated from the above Trinity script containing lengths in the column length):

y <- DGEList(counts=data.set, group=treatment)
y$genes = data.frame(length$length)

Does this mean that the lengths for kf_t00000001-RA will be used for kf_t00010653-RA, leading to nonsense results later on when rpkm() from EdgeR is used with transcript lengths?

After all, y$genes is just a list of lengths with no labels attached to them with the first length in the vector being "5182" but the first transcript in the y$counts is "kf_t00010653-RA".

I have tried sorting the raw counts matrix before putting it into a DGEList object like this:

data.set <- data.set[order(row.names(data.set)), ]

so that the y$counts matrix has the same order as the file used to extract gene lists. Then the lengths should match the correct transcript.

Does this look like a viable solution?

RSEM edgeR Trinity • 354 views

ADD COMMENT • link 20 months ago by bioinfo2345 ▴ 40