Question

Question about length values given by salmon and tximport

0

Entering edit mode

16 months ago

Josh ▴ 20

Hi all, I am doing an RNA-Seq analysis and my main goal is to get a table of counts with the ENSEMBL Gene ID, Gene Name, counts, gene length and TPM and RPKM normalization values.

I understand that salmon allows me to get the .sf files that bring that information I need, except RPKM (to build the salmon index I used the gencode fasta file containing the transcript sequences).

Since the salmon files give me the names of the transcripts, I need to pass them to gene IDs. For that I have used tximport, however tximport I think grabs the values from the EffectiveLength column of the salmon files and not the Length column, since the length values vary for some genes in different samples.

This is what my length data for a sample looks like after using tximport:

| ENSG_ID  | SAMPLE1  | SAMPLE2  | SAMPLE3  | SAMPLE4  | SAMPLE5  |
|---|---|---|---|---|---|
| ENSG00000000003  | 2220.02587  |  2033.72423 |  2264.36606 | 2327.30698  |  2412.35268 |
| ENSG00000000005  | 701.65211  | 701.65211  |  701.65211 |  701.65211 | 701.65211  |
| ENSG00000000419  |  905.66333 |  910.26504 | 909.86483  | 905.28731  | 914.59889  |
|  ENSG00000000457 |  2769.80521 | 2816.65820  |  2616.51354 | 2852.26506  | 2726.30781  |
|  ENSG00000000460 |  2271.36502 |  2149.57222 |  2456.34057 |  2434.60374 |  2290.22742 |

My question is to know how to make the length values homogeneous for each gene in all samples, or are the values as they are generated by tximport correct?

I have this doubt because I have seen that several raw count files from GEO repositories have the length column with homogeneous values for each gene and each sample.

Thanks!

salmon tximport RNA-seq • 1.0k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 16 months ago by Josh ▴ 20

score 2 · Accepted Answer · 2022-12-22

What salmon does is to quantify reads against transcripts. What tximport does is to aggregative that to the gene level plus calculating an average transcript length per gene. Say you have a gene with three isoforms, each of different length. Say celltype A expresses isoformA and isoformB which are both short. Say celltypeB expresses isoformC which is very long. RNA-seq libraries are made from fragmentated cDNA, hence longer transcripts produce more fragments which inherently accumulate more reads. That means isoformC might accumulate a lot more reads that the other two isoforms which, without correction, could lead to the inference that that gene was overexpressed in celltypeB, despite that might be entirely technical due to gene length, not expression level. tximport corrects for this length bias, so the resulting average transcript length it returns can and should be used for something like RPKM as it respects that the effective (that is average) gene length can be different between samples. That in fact is a key point of the method.

tl;dr tximport is correct, use the counts and lengths it returns to get RPKM/FPKM. DESeq2 has a function for that downstream of tximport, see its manuals.