Hi all, I am doing an RNA-Seq analysis and my main goal is to get a table of counts with the ENSEMBL Gene ID, Gene Name, counts, gene length and TPM and RPKM normalization values.
I understand that salmon allows me to get the .sf files that bring that information I need, except RPKM (to build the salmon index I used the gencode fasta file containing the transcript sequences).
Since the salmon files give me the names of the transcripts, I need to pass them to gene IDs. For that I have used tximport, however tximport I think grabs the values from the EffectiveLength column of the salmon files and not the Length column, since the length values vary for some genes in different samples.
This is what my length data for a sample looks like after using tximport:
| ENSG_ID | SAMPLE1 | SAMPLE2 | SAMPLE3 | SAMPLE4 | SAMPLE5 |
|---|---|---|---|---|---|
| ENSG00000000003 | 2220.02587 | 2033.72423 | 2264.36606 | 2327.30698 | 2412.35268 |
| ENSG00000000005 | 701.65211 | 701.65211 | 701.65211 | 701.65211 | 701.65211 |
| ENSG00000000419 | 905.66333 | 910.26504 | 909.86483 | 905.28731 | 914.59889 |
| ENSG00000000457 | 2769.80521 | 2816.65820 | 2616.51354 | 2852.26506 | 2726.30781 |
| ENSG00000000460 | 2271.36502 | 2149.57222 | 2456.34057 | 2434.60374 | 2290.22742 |
My question is to know how to make the length values homogeneous for each gene in all samples, or are the values as they are generated by tximport correct?
I have this doubt because I have seen that several raw count files from GEO repositories have the length column with homogeneous values for each gene and each sample.
Thanks!