Question: Normalization of RNASeq read counts without conditions
gravatar for komal.rathi
3.7 years ago by
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.1k wrote:


I have a data in the form of a dataframe that I downloaded from GTEx Portal. It contains RNASeq gene read counts used in their study.

> dim(expr.df)
[1] 55993  2921
> expr.df[1:10,1:2]
                GTEX-N7MS-0007-SM-2D7W1 GTEX-N7MS-0008-SM-4E3JI
ENSG00000223972                       0                       0
ENSG00000227232                     158                     166
ENSG00000243485                       0                       0
ENSG00000237613                       0                       0
ENSG00000268020                       0                       0
ENSG00000240361                       0                       0
ENSG00000186092                       0                       0
ENSG00000238009                      17                       2
ENSG00000233750                      35                       0
ENSG00000237683                    8489                      34

I checked the sample information file and there is no information about the conditions. I want to normalize the raw counts. For that, I want to use DESeq's getVarianceStabilizedData() function. However this function takes as an input a CountDataSet object. So when I try to make a CountDataSet object using this:

> cds <- newCountDataSet(countData = as.matrix(expr.df))
Error in is(conditions, "matrix") : 
  argument "conditions" is missing, with no default

It spits out an error asking me to specify the conditions. However, there are no conditions in this dataset. How can I normalize these values?

readcounts rna-seq deseq vst limma • 1.5k views
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by komal.rathi3.1k

I think you're getting into variance there.  You just want to normalize for the number of reads sequenced, right?

I believe DESeq still uses median normalization.

I don't know the commands in DESeq but if you want to do it by hand here is the basic process:

Do a scatter plot of your two condition, i.e. GTEX-N7MS-0007-SM-2D7W1  on the X axis and GTEX-N7MS-0008-SM-4E3JI on the Y axis.

a=get the median count value in  GTEX-N7MS-0007-SM-2D7W1 

b=get the median count value in GTEX-N7MS-0008-SM-4E3JI 

your slope, and median normalization factor is b/a

plot the line through your data (y intercept =0) and see if it fits.

Sometimes it works, sometimes you need to use something different.

ADD REPLYlink written 3.7 years ago by Michele Busby1.7k
gravatar for komal.rathi
3.7 years ago by
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.1k wrote:

To save everyone's time, I tried this:

# specify just one condition
cds <- newCountDataSet(countData=as.matrix(expr.df),condition=c(rep(1,2921))) 
cds <- estimateSizeFactors(cds)
# method ="blind" assumes replicates of a single condition
cds <- estimateDispersions(cds, method="blind", fitType="local") 
vst <- getVarianceStabilizedData(cds) # this worked fine

Let me know if my approach seems reasonable or incorrect. I did not want to delete the question just in case other people face the same problem.

ADD COMMENTlink modified 10 months ago • written 3.7 years ago by komal.rathi3.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1576 users visited in the last hour