Question

DESeq2 with a small number of genes

0

Entering edit mode

7.5 years ago

glihm ▴ 660

Dear all,

I am writing a program in order to study the coverage of only one sequence. To sum up the pipeline:

Detect ORFs in the input sequence
Align all reads on the sequence (bowtie), reads come from RNA-seq
Count the number of read in each ORF (5' of reads)
Normalize these counts

Some input sequences have only 6 to 10 ORFs. I want to normalize these counts and I tried DEseq2, which works fine (functionally speaking).

Now, significantly speaking, do you think that evaluate dispersion and normalize counts with DESeq2 for 6 - 10 genes is something valid ? How the adjust P-value will be impacted as few genes are provided for multiple testing.

I would appreciate any comments or suggestions from experienced people with statistics and RNA-seq data normalization.

Thank you !

----- EDIT ------

As the data does not satisfy the assumption mentioned in the C. Yague answers, what kind of count-based normalisation can be applied ? I was thinking about RPKM, but RPKM is more a unit than a normalisation method. Or should I use something like TPM ? And then compute foldchanges from TPM counts ?

Thank you again for your help !

DESeq2 SARTools RNA-Seq Count-normalisation • 3.7k views

ADD COMMENT • link updated 2.7 years ago by Morgan S. ▴ 80 • written 7.5 years ago by glihm ▴ 660

0

Entering edit mode

Dear gilhm,

Can you please tell me what you ended up doing? Were you able to use the small list of genes for analysis, or did you decide ti subset those genes from an analysis of all genes for your organism?

Thanks! Morgan

ADD REPLY • link 2.7 years ago by Morgan S. ▴ 80

score 4 · Accepted Answer · 2016-10-20

4

Entering edit mode

7.5 years ago

Carlo Yague 8.6k

In my opinions, there are a least two issues with DEseq2 in the normalization of your dataset.

First, DESeq2 use the whole distribution of read counts per gene to help estimate dispersion. With so few genes, I'm not sure this estimation would be very reliable.
Normalization with DESeq2 (and many other methods) assumes that there is no global differences accross conditions (most genes are not differentially expressed). Can you assume that this is the case for your ORFs ?

ADD COMMENT • link 7.5 years ago by Carlo Yague 8.6k

0

Entering edit mode

I would guess the second assumption is not valid for his dataset and would suggest to generate a MAplot to visualize this.

ADD REPLY • link 7.5 years ago by WouterDeCoster 47k

0

Entering edit mode

I will generate the MAplot to check that, but as you said the second assumption is not valid.

ADD REPLY • link 7.5 years ago by glihm ▴ 660

0

Entering edit mode

Sounds like there is potentially other data in the set so adding that back into the mix should address these objections (long as the data satisfies assumption #2).

ADD REPLY • link 7.5 years ago by GenoMax 141k

0

Entering edit mode

As said I think that the data can't satisfies the assumption #2. There is a way to normalise data with a other count-based normalisation method without this assumption ?

ADD REPLY • link 7.5 years ago by glihm ▴ 660

0

Entering edit mode

What kind of data do you have ? If you have RNA-seq data, you could try to normalize based on all the other genes count distribution as genomax suggested.

ADD REPLY • link 7.5 years ago by Carlo Yague 8.6k

0

Entering edit mode

Yes, RNA-seq data (and RIBO-seq, which is really closed to RNA-seq). I already have my "global" analysis with differential expression for thousands of genes.

But the point here is that, I was trying to see if studying gene per gene some qualitative parameters are easier to study or to see.

I will try the TPM and see I have what I am looking for.

ADD REPLY • link 7.5 years ago by glihm ▴ 660

2

Entering edit mode

I was trying to see if studying gene per gene some qualitative parameters are easier to study or to see.

I'm not sure this is a good idea, usually its better to integrate all genes into the analysis then look for those that you are interested in.

Anyway, if you still want to do that, you could use the sizefactors() of the global DESeq2 analysis as normalizing factors for your smaller analysis. Those factors are derived from the whole read count per gene distribution (issue #1 is ok) and its easier to assume no global expression changes on all genes rather than on a few genes.

ADD REPLY • link 7.5 years ago by Carlo Yague 8.6k

0

Entering edit mode

Thanks for you answer C. Yague,

The assumption #1, I do think like you and unfortunately this is exactly what I feared.

About the assumption #2, I think it is difficult to assume that because if all the observed region is moving, all the ORFs can potentially be differentially expressed. So, you did right to mentioned this condition, because I think I am not meeting it...

Mouarf, I think I have to look for an other way to proceed, thank you a lot for this answer and comments that confort my opinion !

ADD REPLY • link 7.5 years ago by glihm ▴ 660