Hello all !
I am writing this post to expose my concerns regarding the methodology I am following. My goal is to perform de novo assembly (there is no reference transcriptome nor genome yet for my target organism) with 2 different conditions and 5 samples in each condition. And then to run functionnal annotation and differential expression analysis.
So far, I have run different assemblies (after doing some fastqc of course) and then I have selected the best one in terms of busco and transrate score. Then I have performed a Htseqcount on my bam file which contains all of the mapped reads.
And now, I don't know if it is better to perform functionnal annotation or to run differential expression analysis (using deseq2). In fact I have no clue of what is the difference to do one first or second.
Also I have noticed in my Htseqcount files that there are many transcripts with only very few reads mapping to it. Should I remove these transcripts from my fasta file, so I gain processing time and avoid processing meaningless data ? Or maybie is it better to use "the good transcripts" identified by Transrate to get my final fasta ?
Eventually, I have seen in different posts people running the differential analysis on htseqcount at "gene level". I did it on transcript level which seemed tobe the only possibility, (to do it on gene lvl you need a reference transcriptom, right ?)
I hope it is not too many questions, and that I am clear enough.
Thanks for helping me see this more clearly !
Ok thanks for your advice. Also I would like to create a reference transcriptom for my target organism. So if I understand well I can filter low expression transcripts (according to TPM and size for instance) for this reference transcriptom and annote it.
But for differential expression it is better to keep all genes for statistics reasons (except maybie redundant transcripts). Why is that ? Because I would first think it's better to have a clean transcriptome, with no extremely low expressed transcripts and no very short transcripts
Thanks again !
Personally I would not a take expression values in to account for filtering to get to a reference transcriptome. There will for sure be valid transcripts that only show very low expression (that is biology). Filtering on size is something you could consider to remove some spurious hits, but also here there might/will be transcripts that just are short and are still true transcripts.
AS you might understand by now: defining a reference transcriptome is kinda tricky if you only have a transcriptome assembly (and thus no genome to confirm things) . Also what you consider reference transcriptome might not be what the next person would think it is (eg. do you keep in long ncRNA? or only protein-coding, ... )
It's very hard to define a rule-set to get to a/the reference transcriptome, so what most people will do is to get the set as clean as possible and work with that, keeping the limitations of all this in mind.