RSEM/edgeR/Trinity: Unsorted raw matrix produce nonsense rpkm() results?
0
0
Entering edit mode
20 months ago
bioinfo2345 ▴ 40

When I run RSEM and then join together the different .isoforms.results files with per sample expression counts into a raw counts matrix (using Trinity script abundance_estimates_to_matrix.pl), this matrix is unsorted and does not carry over the same sorting as the isoforms.results files.

For instance, a raw counts matrix I have starts with the following:

   control1   control2   control3   control4   exp1   exp2   exp3   exp4
kf_t00010653-RA 28.00   31.00   34.00   33.00   92.00   100.00  75.00   97.00
kf_t00014892-RA 0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
kf_t00015024-RA 14.45   6.00    6.00    0.00    4.00    2.00    0.00    1.00

The RSEM *isoforms.results file has this sorting:

transcript_id   gene_id length  effective_length        expected_count  TPM     FPKM    IsoPct
kf_t00000001-RA kf_g00000001    5182    4933.94 191.91  5.86    5.34    100.00
kf_t00000002-RA kf_g00000002    2390    2141.94 125.88  8.85    8.07    100.00
kf_t00000003-RA kf_g00000003    654     405.97  0.00    0.00    0.00    0.00

This means that if you casually import the gene lengths into the DGEList object in edgeR like this (length is a table generated from the above Trinity script containing lengths in the column length):

y <- DGEList(counts=data.set, group=treatment)
y$genes = data.frame(length$length)

Does this mean that the lengths for kf_t00000001-RA will be used for kf_t00010653-RA, leading to nonsense results later on when rpkm() from EdgeR is used with transcript lengths?

After all, y$genes is just a list of lengths with no labels attached to them with the first length in the vector being "5182" but the first transcript in the y$counts is "kf_t00010653-RA".

I have tried sorting the raw counts matrix before putting it into a DGEList object like this:

data.set <- data.set[order(row.names(data.set)), ]

so that the y$counts matrix has the same order as the file used to extract gene lists. Then the lengths should match the correct transcript.

Does this look like a viable solution?

RSEM edgeR Trinity • 354 views
ADD COMMENT

Login before adding your answer.

Traffic: 2698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6