Hi all,
For the first time, I am working on some de novo mRNAseq data already analyzed by a company (the same that perfomed the sequencing). The samples are from a species for which we don't have an annotated genome. They assembled the transcriptome with trinity (which resulted in more than 100.000 different transcripts) and then they performed differential expression analysis with DESeq. The transcripts were identified using NR, NT, Pfam, GO, KEGG databases and I have the information of the first 10 best hits for each (if present). I also have information on the coding potential of all of the transcripts (done with BLASTX and TransDecoder, resulting in 37.000 sequences with coding potential).
I also have some mRNA-seq from zebrafish, where reads were all aligned to the known genome (around 35.000 genes). The experiment is the same, so the objective is to compare the differentially expressed genes across the two species.
Now, 100.000 transcripts sounds like a lot, so I was wondering 3 things:
1) is this normal with de novo assemblies and if so what is it due to?
2) since many of these transcripts aligned (with 70-90% similarity) to either intronic or intergenic regions of other species' genomes, would it make sense to only consider the sequences with coding potential (BLASTX/transdecoder results) for differential expression analyses?
3) would it otherwise make sense to get rid of all transcripts with low E-value (BLAST)? So those that were not successfully aligned to any database hit? I've seen papers consider only those with e-value <e-5 and others <e-30
4) in general, what would be the cutoff of E-value up to which I can trust the BLAST result?
I hope I explained myself clearly enough - I'm a bioinformatics/RNAseq newbie but I'd like to learn :) Thank you!!!
Thank you for the observations. I did not fully realize the difference between transcripts and genes until you pointed it out - but then I guess that means in the zebrafish analyses, the different transcript variants are all counted together. While in the de novo, they are counted separately. Making it harder to compare the DE then, did I get it right?
Regarding your point about DE algorithms, my plan was to compare the DEseq results between the two species, so for instance see if gene X that is DE in zebrafish is also DE in the other species, as well as compare the GO and KEGG enrichment analyses for both.
In case I only compare full length transcripts for both species, would I run a separate DEG analysis only considering these genes, or would I start from the DEG analysis done with all available transcripts, then only look at those that are full length?
Thank you!!!
Ah! Sorry, I misunderstood your experimental design - I thought you were directly comparing expression in species X to expression in zebrafish, not DE in species X to DE in Zebra fish. In that case it should be fine.
As regards the transcript/gene issue.... Two appraoches you might consider are clustering together the species X transcripts into genes. This would be easy for those that aligned to Zebrafish genes, and I guess you don't care about those that don't! Alterntaively you could quantify zebrafish transcripts rather than genes. You can use Salmon, Kalisto or RSEM to do this.
That makes sense, thanks!
I will look into clustering the transcripts together into genes, however many aligned to genes from other (more closely related probably) species. Thanks for the input
Hi, I went through the company's pipeline again and realized that they are actually using Corset to cluster the transcripts into genes after trinity assembly. This means that my 100.000 transcripts are actually (likely?) genes.
This should make DE comparison with zebrafish easier. But it also means that I have many more "genes" identified. Would this affect the DE analysis? Many of these genes remain unidentified. 30% are annotated in NR and 10% in Pfam, 80% in NT (but this includes similarities with genomic sequences of other species...)
Would it make sense to get rid of the genes that are not annotated, or that show no coding potential before running DE analysis? Or should I keep them anyways and disregard them afterwards?
Depends on your aims, but if you are interested in how similar the transcriptional responses are between Zebrafish and your species of interest, then there isn't much point in looking at genes that arn't present in Zebrafish.