Number of reads in one gene affect the differential expression analysis?
2
0
Entering edit mode
6.6 years ago
camelbbs ▴ 690

Hi all,

I recently calculate the differential expressed genes from RNASeq data by DESeq. One question is bothering me.

When we checked the number of reads in every gene we annotated, we found some genes have huge number of reads, compare to total reads, such as:

sample  s-664561    s-665905    s-655605    ZC1 ZC2 ZC3
total_reads 1190411 2702168 4201061 7254155 1546197 8178163
TCONS_00022986+TCONS_00022987   831654  2391414 3859575 541428  268746  758906
%   69.86276168 88.49982681 91.87143438 7.463694944 17.38109698 9.279663416


One gene TCONS_00022986+TCONS_00022987 takes about 70-90% of total reads in S group. I doubt this will affect the differential expression analysis in DESeq. So how to handle this issue in analysis. Or just simply remove this gene?

Thanks a lot, Cam

RNA-Seq DESeq counts DEG RNASeq • 2.2k views
0
Entering edit mode

Could you elaborate on your experimental design and analysis pipeline? How was the library prep and sequencing performed, which steps did you take to generate these counts?

0
Entering edit mode

library is total RNA with rRNA depleted, and strand-specific. HTSeq-count for generating counts.

4
Entering edit mode
6.6 years ago

This sort of thing is expected. Gene expression values generally follow a log-normal distribution, and you expect a small number of genes to have a very high expression. This can cause a real problem in differential testing, because a small fractional change in the level of expression from one of these high expressers can suck sequencing capacity from all other genes, effectively reducing the counts in other genes.

To counter this, both edgeR and DESeq use normalisation systems that take this into account. (edgeR uses the trimmed median of M values, DESeq uses upper quartile normalization). More recent versions of cufflinks also account for this effect too. What you absolutely don't want to do is normalise by total library size, as a naïve implementation of FPKM would do.

You should check that the highly expressed genes are not something that shouldn't be there - ribosomal genes do take up around that sort of fraction of RNA in the cell, but should be removed by good RNA-seq library preparation, high levels of ribosomal contamination may indicate samples. Alternatively, if you express transgenes from viral promoters, their RNA can take up a large fraction of all RNA in the cell. Otherwise I probably wouldn't worry that much about it.

0
Entering edit mode

Hi i.sudbery, thanks very much for your reply. However, I am not very understand you say "What you absolutely don't want to do is normalise by total library size, as a naïve implementation of FPKM would do". Can I just compare the FPKM to get the differential expressed genes? I thought that is a better solution in this case.

0
Entering edit mode

Depends how you get the FPKM values. If you were just to divide the number of read pairs for each gene by the gene length and the total number of read pairs in the library (literally Fragments Per Kilobase per Million), the expression of your very highly expressed genes will affect the expression estimates for other genes.

Consider: Your cell contains 10 copies of transcript A and 90 copies of transcript B, for simplicity lets say they are both 1kb in length. If you sequence a million reads, you will get an FPKM of 1000 for gene A and 9000 for gene B.

Now consider a cell with the same 10 copies of transcript A, but 990 copies of transcript B. You will now get an FPKM of 100 for gene A and 9900 for gene B. I.e. a logFC of -1 for gene A and a logFC of 0.14 for gene B, even though gene A hasn't actually changed, but gene B has.

0
Entering edit mode

Greeting I have single-cell sequencing data from three different Patients(hepatocellular cancer) and every sample has 2 replicates(hepatocytes and NPC) , I see a high fraction of ribosomal genes in my samples. I couldn't find any reason for this . could you please tell why we have a high fraction of ribosomal genes?(method :droplet-based sequencing 10xGenomics)

Thank you

1
Entering edit mode
6.6 years ago
agata88 ▴ 850

I think you should normalize number of reads by calculating FPKM (if you have PE reads) and then perform differential expression analysis. You can use cufflinks workflow which include normalization and differential expression and a lot more.

Best,

Agata

2
Entering edit mode

No, tools such as DESeq2 and edgeR require you to use raw read counts without normalization.

0
Entering edit mode

Ah, Ok so alternatively camelbbs can use Cufflinks - with normalization.

0
Entering edit mode

Thanks agata, I agree using FPKM is a safe method to calculate the differential expressed genes, especially in this case.

0
Entering edit mode

Hi agata, I am not sure if cufflinks based on the FPKM or the raw counts. I see the latest version is cuffquant, does it calculate the DE genes based on FPKM?

0
Entering edit mode

As far as I know the DE analysis is performed by cuffdiff implemented in cufflinks software. I don't know cuffquant. The workflow is as follow:

• TopHat, aligning RNAseq

• Cufflinks package: Cufflinks(assembles transcripts) --> Cuffcompare (comparestranscript assemblies to annotation) --> cuffmerge (merges two or more transcripts assemblies) --> cuffdiff (finds differencially expressed genes and transcripts. Detects differencial splicing and promoter use) ---> CummeRbund library implemented in R (plots abundance and differencial expression results from cuffdiff)

FPKM is calculated at cuffdiff workflow level.

Hope it helps, Agata

1
Entering edit mode

Cuffquant was introduced in a recent version of the cufflinks suite- it basically seperates out the quantitation step (calculating FPKMs) from the asesmbly step (cufflinks) and the differential expression testing (cuffdiff).