Question

Cuffdiff in highly different samples

0

Entering edit mode

5.2 years ago

salvocamiolo ▴ 20

Hello everyone, I have got a question about the way Cuffdiff work in determining the genes that are differently expressed. I give you a quick example of what I got. In Sample (a) I have three datasets of genes that are expressed featuring the following percentage of the total reads

Dataset_A          50%
Dataset_B          27%
Dataset_C          23%

After a knocking down experiment all the genes in dataset_C have been removed. This provided a second RNA-seq sample (b). Now if I look at the proportion of reads only considering Dataset_A and Dataset_B I would say that there is a 65% Dataset_A and 35% Dataset_B. Let's say that in sample (b) I get the same number of reads (1000 to make things easier). For all the genes in Dataset_A and Dataset_B I will have a higher number of reads in (b) as compared to (a) and higher RPKM. In theory Cuffdiff should call as upregulated all the genes in the sample (b). This of course did not happen and only a subset was declared as differentially expressed (with both down regulated and up regulated genes). But, just to be sure the results I have got at the end are reliable I would like to understand how Cuffdiff deals with this kind of scenario and if I had to get precautions when running it since a component accounting for 23% of the total reads in the initial sample was absent in the second. Thanks for your help

RNA-Seq Cufflinks rna-seq • 1.1k views

ADD COMMENT • link updated 5.2 years ago by GenoMax 141k • written 5.2 years ago by salvocamiolo ▴ 20

score 3 · Answer 1 · 2019-02-18

3

Entering edit mode

5.2 years ago

Friederike 8.9k

In theory Cuffdiff should call as upregulated all the genes in the sample (b). This of course did not happen

How do you know that's true?

However, I'm not sure I totally follow -- are you saying, you're detecting different numbers of genes in the different data sets? How can "all genes in dataset_C" have been removed? What does that mean? How did you define these gene sets?

Generally, there's somewhat of a consensus to not use Cuffdiff, but rather rely on gene-focused comparisons with additional assessments of differential transcript usage (see, for example, Love et al., 2018 and the bioconductor workflows.

ADD COMMENT • link 5.2 years ago by Friederike 8.9k

0

Entering edit mode

Thanks a lot for your answer and sorry for not being totally clear. The sample is actually made of two species: an organism and a virus infecting it. The three datasets are the organism genes, the viral genes, and the viral non coding RNA. With the knock down experiments I removed the non coding RNA genes therefore leaving the dataset_A and dataset_B alone. Now you answer actually pointed in the right direction I believe. Indeed, It does not matter whats the proportion of dataset_A and dataset_B in sample A since when I calculated the differential expressed genes I directly compared the organism genes and the viral genes in two different analysis. So all my initial assumption was wrong. Thanks a lot for your message!