Question

De novo transcriptom assembly and differential expression analysis.

0

Entering edit mode

3.4 years ago

intern • 0

Hello all !

I am writing this post to expose my concerns regarding the methodology I am following. My goal is to perform de novo assembly (there is no reference transcriptome nor genome yet for my target organism) with 2 different conditions and 5 samples in each condition. And then to run functionnal annotation and differential expression analysis.

So far, I have run different assemblies (after doing some fastqc of course) and then I have selected the best one in terms of busco and transrate score. Then I have performed a Htseqcount on my bam file which contains all of the mapped reads.

And now, I don't know if it is better to perform functionnal annotation or to run differential expression analysis (using deseq2). In fact I have no clue of what is the difference to do one first or second.

Also I have noticed in my Htseqcount files that there are many transcripts with only very few reads mapping to it. Should I remove these transcripts from my fasta file, so I gain processing time and avoid processing meaningless data ? Or maybie is it better to use "the good transcripts" identified by Transrate to get my final fasta ?

Eventually, I have seen in different posts people running the differential analysis on htseqcount at "gene level". I did it on transcript level which seemed tobe the only possibility, (to do it on gene lvl you need a reference transcriptom, right ?)

I hope it is not too many questions, and that I am clear enough.

Thanks for helping me see this more clearly !

RNA-Seq Assembly assembly gene • 918 views

ADD COMMENT • link 3.4 years ago by intern • 0

0

Entering edit mode

Ok thanks for your advice. Also I would like to create a reference transcriptom for my target organism. So if I understand well I can filter low expression transcripts (according to TPM and size for instance) for this reference transcriptom and annote it.

But for differential expression it is better to keep all genes for statistics reasons (except maybie redundant transcripts). Why is that ? Because I would first think it's better to have a clean transcriptome, with no extremely low expressed transcripts and no very short transcripts

Thanks again !

ADD REPLY • link 3.4 years ago by intern • 0

0

Entering edit mode

Personally I would not a take expression values in to account for filtering to get to a reference transcriptome. There will for sure be valid transcripts that only show very low expression (that is biology). Filtering on size is something you could consider to remove some spurious hits, but also here there might/will be transcripts that just are short and are still true transcripts.

AS you might understand by now: defining a reference transcriptome is kinda tricky if you only have a transcriptome assembly (and thus no genome to confirm things) . Also what you consider reference transcriptome might not be what the next person would think it is (eg. do you keep in long ncRNA? or only protein-coding, ... )

It's very hard to define a rule-set to get to a/the reference transcriptome, so what most people will do is to get the set as clean as possible and work with that, keeping the limitations of all this in mind.

ADD REPLY • link 3.4 years ago by lieven.sterck 15k

score 1 · Answer 1 · 2020-11-18

I don't know if it is better to perform functionnal annotation or to run differential expression analysis

these are actually two different and independent analyses, they have nothing to do with each other. You can do DEG analysis without having done functional annotation and vice versa. Doing both is however the best approach, if from your DEG analysis you get some "interesting" genes you likely want to know what they are/do so then functional annotation comes in to play.

Should I remove these transcripts from my fasta file

neh, that is not necessary, the process time gained with removing (a likely small number of ) genes will not be much, moreover the statistics of DEG work best if you take all genes into account (some time might need to be spend though in removing redundancy from your gene set)

I did it on transcript level which seemed tobe the only possibility, (to do it on gene lvl you need a reference transcriptom, right ?)

well, yes you need a reference transcriptome but that is what you are assembling, no. Ok, sure it will not have the quality of a ref transcriptome where lot's of time and effort has been put in to, but still. Key to do this analysis on gene level is that you take care to remove as much as possible redundancy from your dataset