Question: De novo transcriptom assembly and differential expression analysis.
0
gravatar for intern
17 days ago by
intern0
Switzerland
intern0 wrote:

Hello all !

I am writing this post to expose my concerns regarding the methodology I am following. My goal is to perform de novo assembly (there is no reference transcriptome nor genome yet for my target organism) with 2 different conditions and 5 samples in each condition. And then to run functionnal annotation and differential expression analysis.

So far, I have run different assemblies (after doing some fastqc of course) and then I have selected the best one in terms of busco and transrate score. Then I have performed a Htseqcount on my bam file which contains all of the mapped reads.

And now, I don't know if it is better to perform functionnal annotation or to run differential expression analysis (using deseq2). In fact I have no clue of what is the difference to do one first or second.

Also I have noticed in my Htseqcount files that there are many transcripts with only very few reads mapping to it. Should I remove these transcripts from my fasta file, so I gain processing time and avoid processing meaningless data ? Or maybie is it better to use "the good transcripts" identified by Transrate to get my final fasta ?

Eventually, I have seen in different posts people running the differential analysis on htseqcount at "gene level". I did it on transcript level which seemed tobe the only possibility, (to do it on gene lvl you need a reference transcriptom, right ?)

I hope it is not too many questions, and that I am clear enough.

Thanks for helping me see this more clearly !

rna-seq assembly gene • 158 views
ADD COMMENTlink modified 15 days ago • written 17 days ago by intern0

Ok thanks for your advice. Also I would like to create a reference transcriptom for my target organism. So if I understand well I can filter low expression transcripts (according to TPM and size for instance) for this reference transcriptom and annote it.

But for differential expression it is better to keep all genes for statistics reasons (except maybie redundant transcripts). Why is that ? Because I would first think it's better to have a clean transcriptome, with no extremely low expressed transcripts and no very short transcripts

Thanks again !

ADD REPLYlink modified 15 days ago • written 15 days ago by intern0

Personally I would not a take expression values in to account for filtering to get to a reference transcriptome. There will for sure be valid transcripts that only show very low expression (that is biology). Filtering on size is something you could consider to remove some spurious hits, but also here there might/will be transcripts that just are short and are still true transcripts.

AS you might understand by now: defining a reference transcriptome is kinda tricky if you only have a transcriptome assembly (and thus no genome to confirm things) . Also what you consider reference transcriptome might not be what the next person would think it is (eg. do you keep in long ncRNA? or only protein-coding, ... )

It's very hard to define a rule-set to get to a/the reference transcriptome, so what most people will do is to get the set as clean as possible and work with that, keeping the limitations of all this in mind.

ADD REPLYlink written 15 days ago by lieven.sterck9.1k
1
gravatar for lieven.sterck
17 days ago by
lieven.sterck9.1k
VIB, Ghent, Belgium
lieven.sterck9.1k wrote:

I don't know if it is better to perform functionnal annotation or to run differential expression analysis

these are actually two different and independent analyses, they have nothing to do with each other. You can do DEG analysis without having done functional annotation and vice versa. Doing both is however the best approach, if from your DEG analysis you get some "interesting" genes you likely want to know what they are/do so then functional annotation comes in to play.

Should I remove these transcripts from my fasta file

neh, that is not necessary, the process time gained with removing (a likely small number of ) genes will not be much, moreover the statistics of DEG work best if you take all genes into account (some time might need to be spend though in removing redundancy from your gene set)

I did it on transcript level which seemed tobe the only possibility, (to do it on gene lvl you need a reference transcriptom, right ?)

well, yes you need a reference transcriptome but that is what you are assembling, no. Ok, sure it will not have the quality of a ref transcriptome where lot's of time and effort has been put in to, but still. Key to do this analysis on gene level is that you take care to remove as much as possible redundancy from your dataset

ADD COMMENTlink modified 15 days ago • written 17 days ago by lieven.sterck9.1k

Ok thanks. I don't know if it is relevant also but I have 500,000,000 transcripts whereas reference transcriptome of closely related species only have 250,000, that's why I am trying to reduce my data. But I think after cd hit at 95 % and filtering according to mapping quality there are not a lot of other possibilities except expression lvl.

ADD REPLYlink written 11 days ago by intern0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1190 users visited in the last hour