Question: DESeq2 biais when genes are missing from the annotation?
gravatar for corend
16 months ago by
corend70 wrote:

As it is about a bioconductor package, I also posted here.

I am working on RNAseq data,

I made my count table using kallisto and then tximport to work with DESeq2.

My genes are a set of cDNAs, (supposed to be corresponding to all the genes of my species), but the annotation is quite bad, when I align on these cDNAs I get 60% of mapping, instead of 95% on total genome.

I have 2 conditions: (A and B) and 3 replicates in each condition.

My fear is: If a gene is over-expressed in A, not expressed in B, and not in my cDNA list, I expect to have less reads in A than is B and when the normalization by DESeq2 occurs, it could create a bias ?


A: 1 1 1 1 2 2 2 2 3 3

B: 1 1 1 1 2 3 3 3 3 3

3 is not annotated, then after normalization by DESeq2:

A: 1 1 1 1 1 2 2 2 2 2

B: 1 1 1 1 1 1 1 1 2 2

1 over-expressed in B, but it is not true.

How can I deal with this kind of problem?

Should I add a line in my table with "unmapped reads" to have a better normalization?

rna-seq deseq2 • 503 views
ADD COMMENTlink modified 16 months ago by h.mon24k • written 16 months ago by corend70
gravatar for Asaf
16 months ago by
Asaf5.3k wrote:

I'll start from the end: adding unmapped reads will not help with normalization.

And for the main question: DESeq2 uses the median value of the ratio between A and B assuming most of the genes have the same expression level. If this assumption holds for your data as well then you're safe using DESeq2. You can start validating this assumption by plotting expression level in A vs B and see that you get a nice correlation plot. I think that you'll be fine using DESeq2 normalization.

For the sake of getting better results you might want to have a better annotation of your genome of course, you can easily do that with the transcriptome data that you already have.

ADD COMMENTlink written 16 months ago by Asaf5.3k

My expression levels are supposed to be similar as I work on different tissues of the same organism.

But building a new gff with cufflinks and use it to improve my results seems a good option !

ADD REPLYlink written 16 months ago by corend70
gravatar for h.mon
16 months ago by
h.mon24k wrote:

If you are certain the culprit is an incomplete annotation, you can use Cufflinks or Stringtie (recommended) to do a reference annotation-based transcript assembly (RABT assembly), then use this extended transcript set to perform the kallisto / tximport / DESeq2 workflow.

It may be, however, that you have other problems, for example, a high proportion of rRNA on your sequencing. Did you check for other issues?

ADD COMMENTlink written 16 months ago by h.mon24k

I don't know what is the proportion of rRNA in my data, but the sequencing what made purifying polyA RNAs.

As you and the previous answer suggested, I will build a new gff with cufflinks, it seems to be the best option !

ADD REPLYlink written 16 months ago by corend70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1075 users visited in the last hour