Entering edit mode

6 weeks ago

JACKY
▴
10

I have a counts dataframe of RNA-seq dataset, and got the gene lengths using this code:

```
exons = exonsBy(EnsDb.Hsapiens.v86, by="gene")
exons = reduce(exons)
len = sum(width(exons))
INDEX = intersect(rownames(counts),names(len))
geneLengths = len[INDEX ]
counts = counts[INDEX ,]
```

And used the lengths to normalize the data from counts to TPM, using this code:

```
rpkm <- apply(X = subset(counts),
MARGIN = 2,
FUN = function(x) {
10^9 * x / geneLengths / sum(as.numeric(x))
})
TPM <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()
```

I don't get errors, I do get the normalized data.. but I also get a list of warnings like this:

```
Warning messages:
1: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
2: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
3: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
4: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
5: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
6: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
7: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
8: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
9: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
```

And it goes on for about 50 warnings...

Should I pay attention to these warnings? do they affect the normalization process?

I think you should be very concerned. It is hard to see from the code fragments what goes wrong, but I would double check that all objects involved have the desired length or dimensions which is likely not the case. Simple example:

So, x and genelength are not of the same size. Further, why do you use the subset function without further arguments? See ?subset

Also, I suggest using a proven Bioconductor package to do the job.

A proven Bioconductor package you mean using a package for normalization?

If so, is there any package you recommend ?

for example edgeR see https://support.bioconductor.org/p/64585/

Refer to the problem with rpkm (and tpm) , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7373998/ and other sources for a discussion of the validity of rpkm/tpm for cross sample comparisons.

Thanks!

Do you also think my genelength calculation method is slumpy?

What do you think about this approach?

Reposted (with additions) to Bioconductor https://support.bioconductor.org/p/9146002

Hi Gordon, do you really mean to link to this post? Looks very different

Sorry, wrong link, now corrected.