DESeq2 normalisation: is the size of the gene taken into account?
3
4
Entering edit mode
9.0 years ago
Aurelie MLB ▴ 360

Hello,

I do not manage to really understand if the DESeq2 normalisation and regularized log transformation are taking the size of the gene into account. Do they?

It seems to me that they are not...But I am probably missing something. Do I have a bias toward long genes when I am using DESeq to find differentially expressed genes or when I am looking at expression profiles after a regularized log transformation ?

Many thanks

RNA-Seq • 12k views
ADD COMMENT
8
Entering edit mode
9.0 years ago

No the normalization steps don't take gene size into account, since it doesn't matter. You do not have a bias toward longer genes, rather you have increased power to find changes in them given a constant expression level. This is a good thing, you do not want to try to get rid of it.

If you're doing something like GO enrichment or other downstream analyses where gene length can play a role, then you should account for it there (see, for example, the GOseq package).

ADD COMMENT
0
Entering edit mode

Thank you !

ADD REPLY
0
Entering edit mode

I don't understand why in goseq they calculate the median not the sum of transcripts at section 5.3 ! Do you have any comments on that?

ADD REPLY
0
Entering edit mode

You should probably post this as a separate question.

ADD REPLY
0
Entering edit mode

Right! Do you see it as an issue so I make a separate post about it?

ADD REPLY
1
Entering edit mode

Well, its a legitimate question and unrelated to the current thread, so yes.

ADD REPLY
0
Entering edit mode

@Devon Ryan I didn't understand why you will increased power to find changes in them given a constant expression level? Do you mean that you want to look for higher count values in longer genes across samples? Thanks!

ADD REPLY
0
Entering edit mode

Longer genes have higher counts, so their relative expression levels across conditions is easier to measure.

ADD REPLY
7
Entering edit mode
9.0 years ago

If you are comparing the same gene among different samples, then it doesn't really matter since you will be normalizing the gene in the different samples by the same length.

If you want to compare different genes within the same sample, then gene length would matter (DESeq2 wasn't really made to do this anyways). However, I don't think trying to compare different genes within a sample sample is valid, depending on how you arrived at your tag counts.

For example, if you only considered uniquely mapped reads in generating your tag counts, then for genes with repetitive/conserved regions, you will be artificially under-tag-counting that gene.

ADD COMMENT
0
Entering edit mode

Hello,

OK thank you I realise now why the size is not important in comparisons between samples. And I can see why it is a problem to compare gene expression with a sample...

May I ask you another question then please? What I actually would like to do is to inspect the expression of all genes within a sample to see how much markers are expressed in a control sample for instance. So far, I have been using the regularised log transformation of DESeq2 on the counts and plotted the log value (y axis) versus the genes (x axis). I get from your answer that it might be misleading to do this... But would there be a better way? Would a classical log2 transformation on FPKM be better as it would at least account for the size? (and yes I did considered the uniquely mapped reads only...:( )

ADD REPLY
0
Entering edit mode

Are you trying to assess how abundantly a gene is expressed for experimental purposes (insitu hybs, transgenic targets)? I get that question a lot from my lab mates.

It is not an exact science since the signal you will get from whatever marker you are using will depend on many different factors, of which, the abundance of expression might not play that big of a role.

What I usually end up doing for my lab mates is just rank their candidate genes by tag counts per kb and they can choose the top 10 genes or something. I don't have enough data to say whether there is some kind of correlation between tag counts per kb and marker signal.

ADD REPLY
0
Entering edit mode

Yes the purpose would be similar.

May I ask you how the tag counts per kb is different from FPKM ? apologies for any stupid question here :)

ADD REPLY
2
Entering edit mode
9.0 years ago
Michael Love ★ 2.6k
We discourage cross posting the same question on multiple sites because it duplicates everyone's effort in answering your questions. At the least, please link to the Bioc support site posts.
ADD COMMENT
0
Entering edit mode

Apologies! I did not know. I posted here first and then saw that the Bioconductor support website was recommended in your documentation so I thought it would be more appropriate to post there finally. All I can say now is that it will not happen again...

The bioconductor post is there with your answer is there: https://support.bioconductor.org/p/67132/

ADD REPLY
0
Entering edit mode

no worries. thanks for adding the link. this helps people follow the trail.

ADD REPLY

Login before adding your answer.

Traffic: 2372 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6