Question: Number of reads in one gene affect the differential expression analysis?
0
gravatar for camelbbs
2.6 years ago by
camelbbs650
China
camelbbs650 wrote:

Hi all,

I recently calculate the differential expressed genes from RNASeq data by DESeq. One question is bothering me.

When we checked the number of reads in every gene we annotated, we found some genes have huge number of reads, compare to total reads, such as:

sample  s-664561    s-665905    s-655605    ZC1 ZC2 ZC3
total_reads 1190411 2702168 4201061 7254155 1546197 8178163
TCONS_00022986+TCONS_00022987   831654  2391414 3859575 541428  268746  758906
%   69.86276168 88.49982681 91.87143438 7.463694944 17.38109698 9.279663416

One gene TCONS_00022986+TCONS_00022987 takes about 70-90% of total reads in S group. I doubt this will affect the differential expression analysis in DESeq. So how to handle this issue in analysis. Or just simply remove this gene?

Thanks a lot, Cam

rna-seq deg deseq counts rnaseq • 1.2k views
ADD COMMENTlink modified 2.6 years ago by i.sudbery4.3k • written 2.6 years ago by camelbbs650

Could you elaborate on your experimental design and analysis pipeline? How was the library prep and sequencing performed, which steps did you take to generate these counts?

ADD REPLYlink written 2.6 years ago by WouterDeCoster38k

library is total RNA with rRNA depleted, and strand-specific. HTSeq-count for generating counts.

ADD REPLYlink written 2.6 years ago by camelbbs650
4
gravatar for i.sudbery
2.6 years ago by
i.sudbery4.3k
Sheffield, UK
i.sudbery4.3k wrote:

This sort of thing is expected. Gene expression values generally follow a log-normal distribution, and you expect a small number of genes to have a very high expression. This can cause a real problem in differential testing, because a small fractional change in the level of expression from one of these high expressers can suck sequencing capacity from all other genes, effectively reducing the counts in other genes.

To counter this, both edgeR and DESeq use normalisation systems that take this into account. (edgeR uses the trimmed median of M values, DESeq uses upper quartile normalization). More recent versions of cufflinks also account for this effect too. What you absolutely don't want to do is normalise by total library size, as a naïve implementation of FPKM would do.

You should check that the highly expressed genes are not something that shouldn't be there - ribosomal genes do take up around that sort of fraction of RNA in the cell, but should be removed by good RNA-seq library preparation, high levels of ribosomal contamination may indicate samples. Alternatively, if you express transgenes from viral promoters, their RNA can take up a large fraction of all RNA in the cell. Otherwise I probably wouldn't worry that much about it.

ADD COMMENTlink written 2.6 years ago by i.sudbery4.3k

Hi i.sudbery, thanks very much for your reply. However, I am not very understand you say "What you absolutely don't want to do is normalise by total library size, as a naïve implementation of FPKM would do". Can I just compare the FPKM to get the differential expressed genes? I thought that is a better solution in this case.

ADD REPLYlink written 2.6 years ago by camelbbs650

Depends how you get the FPKM values. If you were just to divide the number of read pairs for each gene by the gene length and the total number of read pairs in the library (literally Fragments Per Kilobase per Million), the expression of your very highly expressed genes will affect the expression estimates for other genes.

Consider: Your cell contains 10 copies of transcript A and 90 copies of transcript B, for simplicity lets say they are both 1kb in length. If you sequence a million reads, you will get an FPKM of 1000 for gene A and 9000 for gene B.

Now consider a cell with the same 10 copies of transcript A, but 990 copies of transcript B. You will now get an FPKM of 100 for gene A and 9900 for gene B. I.e. a logFC of -1 for gene A and a logFC of 0.14 for gene B, even though gene A hasn't actually changed, but gene B has.

EdgeR, DESeq and recent versions of cufflinks have normalization methods that accounts for this effect. However, see this article and this blog post about the accuracy of cufflinks quantification.

ADD REPLYlink written 2.6 years ago by i.sudbery4.3k
1
gravatar for agata88
2.6 years ago by
agata88770
Poland
agata88770 wrote:

I think you should normalize number of reads by calculating FPKM (if you have PE reads) and then perform differential expression analysis. You can use cufflinks workflow which include normalization and differential expression and a lot more.

http://cole-trapnell-lab.github.io/cufflinks/

Best,

Agata

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by agata88770
2

No, tools such as DESeq2 and edgeR require you to use raw read counts without normalization.

ADD REPLYlink written 2.6 years ago by WouterDeCoster38k

Ah, Ok so alternatively camelbbs can use Cufflinks - with normalization.

ADD REPLYlink written 2.6 years ago by agata88770

Thanks agata, I agree using FPKM is a safe method to calculate the differential expressed genes, especially in this case.

ADD REPLYlink written 2.6 years ago by camelbbs650

Hi agata, I am not sure if cufflinks based on the FPKM or the raw counts. I see the latest version is cuffquant, does it calculate the DE genes based on FPKM?

ADD REPLYlink written 2.6 years ago by camelbbs650

Read this paper http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

As far as I know the DE analysis is performed by cuffdiff implemented in cufflinks software. I don't know cuffquant. The workflow is as follow:

  • bowtie read aligner

  • TopHat, aligning RNAseq

  • Cufflinks package: Cufflinks(assembles transcripts) --> Cuffcompare (comparestranscript assemblies to annotation) --> cuffmerge (merges two or more transcripts assemblies) --> cuffdiff (finds differencially expressed genes and transcripts. Detects differencial splicing and promoter use) ---> CummeRbund library implemented in R (plots abundance and differencial expression results from cuffdiff)

FPKM is calculated at cuffdiff workflow level.

Hope it helps, Agata

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by agata88770
1

Cuffquant was introduced in a recent version of the cufflinks suite- it basically seperates out the quantitation step (calculating FPKMs) from the asesmbly step (cufflinks) and the differential expression testing (cuffdiff).

ADD REPLYlink written 2.6 years ago by i.sudbery4.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 757 users visited in the last hour