Warning - longer object length is not a multiple of shorter object length
0
0
Entering edit mode
20 months ago
JACKY ▴ 140

I have a counts dataframe of RNA-seq dataset, and got the gene lengths using this code:

exons = exonsBy(EnsDb.Hsapiens.v86, by="gene")
exons = reduce(exons)
len = sum(width(exons))
INDEX = intersect(rownames(counts),names(len))
geneLengths = len[INDEX ]
counts = counts[INDEX ,]

And used the lengths to normalize the data from counts to TPM, using this code:

rpkm <- apply(X = subset(counts),
               MARGIN = 2,
               FUN = function(x) {
                 10^9 * x / geneLengths / sum(as.numeric(x))
               })

TPM <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()

I don't get errors, I do get the normalized data.. but I also get a list of warnings like this:

Warning messages:
1: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
2: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
3: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
4: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
5: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
6: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
7: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
8: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length
9: In 10^9 * x/genelength :
  longer object length is not a multiple of shorter object length

And it goes on for about 50 warnings...

Should I pay attention to these warnings? do they affect the normalization process?

genelength r TPM RNA-seq • 2.0k views
ADD COMMENT
0
Entering edit mode

I think you should be very concerned. It is hard to see from the code fragments what goes wrong, but I would double check that all objects involved have the desired length or dimensions which is likely not the case. Simple example:

> 10^9 * 1:3 / 1:15
 [1] 1000000000 1000000000 1000000000  250000000  400000000  500000000
 [7]  142857143  250000000  333333333  100000000  181818182  250000000
[13]   76923077  142857143  200000000
> 10^9 * 1:3 / 1:10
 [1] 1000000000 1000000000 1000000000  250000000  400000000  500000000
 [7]  142857143  250000000  333333333  100000000
Warning message:
In 10^9 * 1:3/1:10 :
  longer object length is not a multiple of shorter object length

So, x and genelength are not of the same size. Further, why do you use the subset function without further arguments? See ?subset

Also, I suggest using a proven Bioconductor package to do the job.

ADD REPLY
0
Entering edit mode

A proven Bioconductor package you mean using a package for normalization?

If so, is there any package you recommend ?

ADD REPLY
0
Entering edit mode

for example edgeR see https://support.bioconductor.org/p/64585/

Refer to the problem with rpkm (and tpm) , https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7373998/ and other sources for a discussion of the validity of rpkm/tpm for cross sample comparisons.

ADD REPLY
0
Entering edit mode

Thanks!

Do you also think my genelength calculation method is slumpy?

What do you think about this approach?

library (EDASeq)
ensembl_list <-  rownames(counts)
getGeneLengthAndGCContent(ensembl_list, "hsa")
ADD REPLY
0
Entering edit mode

Reposted (with additions) to Bioconductor https://support.bioconductor.org/p/9146002

ADD REPLY
0
Entering edit mode

Hi Gordon, do you really mean to link to this post? Looks very different

ADD REPLY
0
Entering edit mode

Sorry, wrong link, now corrected.

ADD REPLY

Login before adding your answer.

Traffic: 1876 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6