Question

Normalization of RNASeq read counts without conditions

2

Entering edit mode

9.8 years ago

komal.rathi ★ 4.1k

Hello,

I have a data in the form of a dataframe that I downloaded from GTEx Portal. It contains RNASeq gene read counts used in their study.

> dim(expr.df)
[1] 55993  2921
> expr.df[1:10,1:2]
                GTEX-N7MS-0007-SM-2D7W1 GTEX-N7MS-0008-SM-4E3JI
ENSG00000223972                       0                       0
ENSG00000227232                     158                     166
ENSG00000243485                       0                       0
ENSG00000237613                       0                       0
ENSG00000268020                       0                       0
ENSG00000240361                       0                       0
ENSG00000186092                       0                       0
ENSG00000238009                      17                       2
ENSG00000233750                      35                       0
ENSG00000237683                    8489                      34

I checked the sample information file and there is no information about the conditions. I want to normalize the raw counts. For that, I want to use DESeq's getVarianceStabilizedData() function. However this function takes as an input a CountDataSet object. So when I try to make a CountDataSet object using this:

> cds <- newCountDataSet(countData = as.matrix(expr.df))
Error in is(conditions, "matrix") : 
  argument "conditions" is missing, with no default

It spits out an error asking me to specify the conditions. However, there are no conditions in this dataset. How can I normalize these values?

readcounts RNA-Seq limma vst deseq • 4.2k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

I think you're getting into variance there. You just want to normalize for the number of reads sequenced, right?

I believe DESeq still uses median normalization.

I don't know the commands in DESeq but if you want to do it by hand here is the basic process:

Do a scatter plot of your two condition, i.e. GTEX-N7MS-0007-SM-2D7W1 on the X axis and GTEX-N7MS-0008-SM-4E3JI on the Y axis.

a=get the median count value in GTEX-N7MS-0007-SM-2D7W1

b=get the median count value in GTEX-N7MS-0008-SM-4E3JI

your slope, and median normalization factor is b/a

plot the line through your data (y intercept =0) and see if it fits.

Sometimes it works, sometimes you need to use something different.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Michele Busby ★ 2.2k

score 6 · Accepted Answer · 2014-07-08

6

Entering edit mode

9.8 years ago

komal.rathi ★ 4.1k

To save everyone's time, I tried this:

# specify just one condition
cds <- newCountDataSet(countData=as.matrix(expr.df),condition=c(rep(1,2921))) 
cds <- estimateSizeFactors(cds)
# method ="blind" assumes replicates of a single condition
cds <- estimateDispersions(cds, method="blind", fitType="local") 
vst <- getVarianceStabilizedData(cds) # this worked fine

Let me know if my approach seems reasonable or incorrect. I did not want to delete the question just in case other people face the same problem.

ADD COMMENT • link 6.9 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

c(rep(1,2921) is this number of samples?

ADD REPLY • link 3.7 years ago by svp ▴ 680

0

Entering edit mode

yes. They represent the number of samples

ADD REPLY • link 3.6 years ago by Karma ▴ 310