Question

Questions about gene length and GC content in CQN normaliztion

2

Entering edit mode

5.4 years ago

boymin2020 ▴ 80

Hi,

There are several questions after I read the manual of CQN normalization method. Although I also have checked several related posts of Biostar, but still confused a lot.

How to get the information of gene length? I think it is easy to calculate as the gene bands can be obtained directly from Ensembl website (end bp - start bp + 1?). But it seems that it is more scientific to sum all of the exonic bands for each gene.
How to get the information of GC % content? Unlike gene length, Ensembl website directly gives the GC % contents. But if the gene length is not calculated as I think, they also can not be used.
If no GC bias and gene length bias occur while CQN normalization method is used, what effect will be caused?
Are the residual values after CQN log2-scaled RPM by default?

In sum, I want to know the most exact gene length and GC % content.

Thanks,

CQN RNA-Seq gene length GC content • 2.3k views

ADD COMMENT • link updated 5.2 years ago by rrbutleriii ▴ 260 • written 5.4 years ago by boymin2020 ▴ 80

score 2 · Answer 1 · 2019-02-09

See this post for question one and two. Specifically, if all you need is gene length and GC, and you don't want to learn to access biomaRt directly, this will work (but takes a little time depending on the size of you matrix).

library (EDASeq)
ensembl_list <- c("ENSG00000000003","ENSG00000000419","ENSG00000000457","ENSG00000000460")
getGeneLengthAndGCContent(ensembl_list, "hsa")

Question 3: It will still conduct quantile normalization of the data

Question 4: Yes, see the example on page 4 of the vignette.