Different number of reads in RNA-Seq samples
0
0
Entering edit mode
4.5 years ago

Hi,

I am working on a human RNA-Seq project which is included 2 conditions (tumor vs. normal) with 21 and 5 replicate respectively. 4 samples in tumor condition contain 250-300 million reads however other samples consist of 60-100 million reads on average. Since the results show significant differences between these samples it seems DESeq2 normalization does not help.

Is there any suggestion for fixing this problem?

Thanks.

Reza

RNA-Seq rna-seq • 1.7k views
ADD COMMENT
2
Entering edit mode

it seems DESeq2 normalization does not help.

What makes you conclude this?

ADD REPLY
0
Entering edit mode

Heatmap and PCA plot of my data shows great differences between these 4 samples and the others which I think is because of the millions of reads per sample.

ADD REPLY
1
Entering edit mode

Can you post the code you used? I believe DESeq2 does account for library size differences. Checkout the estimateSizeFactors function.

Mike Love answers a similar question here: https://support.bioconductor.org/p/107292/

He says that a ~2x more reads can be handled with the size correction in DESeq2 (above function). By the way, the edgeR package uses a different normalisation method (TMM): https://www.rdocumentation.org/packages/edgeR/versions/3.14.0/topics/calcNormFactors

You can also specify the same normalisation method as DESeq2 if you like.

ADD REPLY
0
Entering edit mode

Thanks for your reply. You can see the code I have used below. I also read Mike Love's answer and his suggestions for solving different sequencing depth in samples but I have no idea how to handle my own data.

    sampleTable <- data.frame(sampleName = sampleNames,
                              fileName = sampleFiles,
                              condition = sampleCondition)

    ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                           design = ~ condition)
treatments <- c("Normal","Leukemia")
ddsHTSeq$condition
#Setting the factor levels
colData(ddsHTSeq)$condition <- factor(colData(ddsHTSeq)$condition,
                                      levels = treatments)
ddsHTSeq$condition
# Differential Expression Analysis
dds <- DESeq(ddsHTSeq)

I used this code to check the number of reads and it seems the problem is from here. "S1", "S2", "S3" and "S4" sample have too low number of reads compared to others.

round(colSums(counts(dds))/1e6)

       N1        N2        N3        N5        N7 S1.counts S2.counts 
      116        27        81       100        39         7         8 
S3.counts S4.counts        L2        L3        L7        L8        L9 
        6         6        47        29        12        41        35 
      L10       L17       L19       L20       L21       L22       L23 
       17        38        32        49        45        35        39 
      L24       L25       L26       L27       L28 
       34        53        22        16        15

This is Heatmap of top 500 DE genes which clearly shows the issue.

rld <- rlogTransformation(dds, blind=T)
vsd <- varianceStabilizingTransformation(dds, blind=T)  

library("RColorBrewer")
library("gplots")
# 500 top expressed genes with heatmap
select <- order(rowMeans(counts(dds,normalized=T)),decreasing=T)[1:500]
my_palette <- colorRampPalette(c("blue",'white','red'))(n=500)
heatmap.2(assay(vsd)[select,], col=my_palette,
          scale="row", key=T, keysize=1, symkey=T,
          density.info="none", trace="none",
          cexCol=0.6, labRow=F,
          main="Heatmap of 500 DE Genes in Gasteric Cancer")

Heatmap

ADD REPLY
0
Entering edit mode

counts returns the original, non-normalised counts.

His solution is to filter out those genes with low counts:

dds <- estimateSizeFactors(dds)
keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= 6
dds <- dds[keep,]

Note normalized=TRUE flag in counts, which you should be using in your equation to look at the normalised counts.

At this stage, either you drop those samples completely, filter as suggested above or continue with your current setup with the assumption that the default normalisation is enough to handle the library differences. From my reading, the number of replicates has more of an impact of DE gene discovery than library size differences.

ADD REPLY
0
Entering edit mode

Thanks for your recommendations. One of our project members suggests dividing the number of counts of each gene on the sum of counts of all genes in a sample to transform the absolute frequency to relative frequency and then performing normalization. Do you think this is right to do?

ADD REPLY
0
Entering edit mode

That sounds very similar to length based normalisation methods such as RPKM. The Deseq2 manual specifically says to input unnormalised data. http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

IMO that's a bad idea.

I'm still confused why you are not satisfied by the normalisation method used by deseq2? Especially if the library sizes are within normal ranges.

ADD REPLY

Login before adding your answer.

Traffic: 2520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6