When I run RSEM and then join together the different .isoforms.results files with per sample expression counts into a raw counts matrix (using Trinity script abundance_estimates_to_matrix.pl), this matrix is unsorted and does not carry over the same sorting as the isoforms.results files.
For instance, a raw counts matrix I have starts with the following:
control1 control2 control3 control4 exp1 exp2 exp3 exp4
kf_t00010653-RA 28.00 31.00 34.00 33.00 92.00 100.00 75.00 97.00
kf_t00014892-RA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
kf_t00015024-RA 14.45 6.00 6.00 0.00 4.00 2.00 0.00 1.00
The RSEM *isoforms.results file has this sorting:
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
kf_t00000001-RA kf_g00000001 5182 4933.94 191.91 5.86 5.34 100.00
kf_t00000002-RA kf_g00000002 2390 2141.94 125.88 8.85 8.07 100.00
kf_t00000003-RA kf_g00000003 654 405.97 0.00 0.00 0.00 0.00
This means that if you casually import the gene lengths into the DGEList object in edgeR like this (length is a table generated from the above Trinity script containing lengths in the column length):
y <- DGEList(counts=data.set, group=treatment)
y$genes = data.frame(length$length)
Does this mean that the lengths for kf_t00000001-RA will be used for kf_t00010653-RA, leading to nonsense results later on when rpkm() from EdgeR is used with transcript lengths?
After all, y$genes is just a list of lengths with no labels attached to them with the first length in the vector being "5182" but the first transcript in the y$counts is "kf_t00010653-RA".
I have tried sorting the raw counts matrix before putting it into a DGEList object like this:
data.set <- data.set[order(row.names(data.set)), ]
so that the y$counts matrix has the same order as the file used to extract gene lists. Then the lengths should match the correct transcript.
Does this look like a viable solution?