Question: DESeq2 with a small number of genes
0
gravatar for glihm
3.4 years ago by
glihm610
France
glihm610 wrote:

Dear all,

I am writing a program in order to study the coverage of only one sequence. To sum up the pipeline:

  1. Detect ORFs in the input sequence
  2. Align all reads on the sequence (bowtie), reads come from RNA-seq
  3. Count the number of read in each ORF (5' of reads)
  4. Normalize these counts

Some input sequences have only 6 to 10 ORFs. I want to normalize these counts and I tried DEseq2, which works fine (functionally speaking).

Now, significantly speaking, do you think that evaluate dispersion and normalize counts with DESeq2 for 6 - 10 genes is something valid ? How the adjust P-value will be impacted as few genes are provided for multiple testing.

I would appreciate any comments or suggestions from experienced people with statistics and RNA-seq data normalization.

Thank you !

----- EDIT ------

As the data does not satisfy the assumption mentioned in the C. Yague answers, what kind of count-based normalisation can be applied ? I was thinking about RPKM, but RPKM is more a unit than a normalisation method. Or should I use something like TPM ? And then compute foldchanges from TPM counts ?

Thank you again for your help !

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by glihm610
4
gravatar for Carlo Yague
3.4 years ago by
Carlo Yague4.9k
Canada
Carlo Yague4.9k wrote:

In my opinions, there are a least two issues with DEseq2 in the normalization of your dataset.

  1. First, DESeq2 use the whole distribution of read counts per gene to help estimate dispersion. With so few genes, I'm not sure this estimation would be very reliable.

  2. Normalization with DESeq2 (and many other methods) assumes that there is no global differences accross conditions (most genes are not differentially expressed). Can you assume that this is the case for your ORFs ?

ADD COMMENTlink written 3.4 years ago by Carlo Yague4.9k

I would guess the second assumption is not valid for his dataset and would suggest to generate a MAplot to visualize this.

ADD REPLYlink written 3.4 years ago by WouterDeCoster43k

I will generate the MAplot to check that, but as you said the second assumption is not valid.

ADD REPLYlink written 3.4 years ago by glihm610

Sounds like there is potentially other data in the set so adding that back into the mix should address these objections (long as the data satisfies assumption #2).

ADD REPLYlink written 3.4 years ago by genomax78k

As said I think that the data can't satisfies the assumption #2. There is a way to normalise data with a other count-based normalisation method without this assumption ?

ADD REPLYlink written 3.4 years ago by glihm610

What kind of data do you have ? If you have RNA-seq data, you could try to normalize based on all the other genes count distribution as genomax suggested.

ADD REPLYlink written 3.4 years ago by Carlo Yague4.9k

Yes, RNA-seq data (and RIBO-seq, which is really closed to RNA-seq). I already have my "global" analysis with differential expression for thousands of genes.

But the point here is that, I was trying to see if studying gene per gene some qualitative parameters are easier to study or to see.

I will try the TPM and see I have what I am looking for.

ADD REPLYlink written 3.4 years ago by glihm610
2

I was trying to see if studying gene per gene some qualitative parameters are easier to study or to see.

I'm not sure this is a good idea, usually its better to integrate all genes into the analysis then look for those that you are interested in.

Anyway, if you still want to do that, you could use the sizefactors() of the global DESeq2 analysis as normalizing factors for your smaller analysis. Those factors are derived from the whole read count per gene distribution (issue #1 is ok) and its easier to assume no global expression changes on all genes rather than on a few genes.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Carlo Yague4.9k

Thanks for you answer C. Yague,

The assumption #1, I do think like you and unfortunately this is exactly what I feared.

About the assumption #2, I think it is difficult to assume that because if all the observed region is moving, all the ORFs can potentially be differentially expressed. So, you did right to mentioned this condition, because I think I am not meeting it...

Mouarf, I think I have to look for an other way to proceed, thank you a lot for this answer and comments that confort my opinion !

ADD REPLYlink written 3.4 years ago by glihm610
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1523 users visited in the last hour