Normalization of RNASeq read counts without conditions
Entering edit mode
9.8 years ago
komal.rathi ★ 4.1k


I have a data in the form of a dataframe that I downloaded from GTEx Portal. It contains RNASeq gene read counts used in their study.

> dim(expr.df)
[1] 55993  2921
> expr.df[1:10,1:2]
                GTEX-N7MS-0007-SM-2D7W1 GTEX-N7MS-0008-SM-4E3JI
ENSG00000223972                       0                       0
ENSG00000227232                     158                     166
ENSG00000243485                       0                       0
ENSG00000237613                       0                       0
ENSG00000268020                       0                       0
ENSG00000240361                       0                       0
ENSG00000186092                       0                       0
ENSG00000238009                      17                       2
ENSG00000233750                      35                       0
ENSG00000237683                    8489                      34

I checked the sample information file and there is no information about the conditions. I want to normalize the raw counts. For that, I want to use DESeq's getVarianceStabilizedData() function. However this function takes as an input a CountDataSet object. So when I try to make a CountDataSet object using this:

> cds <- newCountDataSet(countData = as.matrix(expr.df))
Error in is(conditions, "matrix") : 
  argument "conditions" is missing, with no default

It spits out an error asking me to specify the conditions. However, there are no conditions in this dataset. How can I normalize these values?

readcounts RNA-Seq limma vst deseq • 4.2k views
Entering edit mode

I think you're getting into variance there. You just want to normalize for the number of reads sequenced, right?

I believe DESeq still uses median normalization.

I don't know the commands in DESeq but if you want to do it by hand here is the basic process:

Do a scatter plot of your two condition, i.e. GTEX-N7MS-0007-SM-2D7W1 on the X axis and GTEX-N7MS-0008-SM-4E3JI on the Y axis.

a=get the median count value in GTEX-N7MS-0007-SM-2D7W1

b=get the median count value in GTEX-N7MS-0008-SM-4E3JI

your slope, and median normalization factor is b/a

plot the line through your data (y intercept =0) and see if it fits.

Sometimes it works, sometimes you need to use something different.

Entering edit mode
9.8 years ago
komal.rathi ★ 4.1k

To save everyone's time, I tried this:

# specify just one condition
cds <- newCountDataSet(countData=as.matrix(expr.df),condition=c(rep(1,2921))) 
cds <- estimateSizeFactors(cds)
# method ="blind" assumes replicates of a single condition
cds <- estimateDispersions(cds, method="blind", fitType="local") 
vst <- getVarianceStabilizedData(cds) # this worked fine

Let me know if my approach seems reasonable or incorrect. I did not want to delete the question just in case other people face the same problem.

Entering edit mode

c(rep(1,2921) is this number of samples?

Entering edit mode

yes. They represent the number of samples


Login before adding your answer.

Traffic: 3349 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6