For the first time, I am working on some de novo mRNAseq data already analyzed by a company (the same that perfomed the sequencing). The samples are from a species for which we don't have an annotated genome. They assembled the transcriptome with trinity (which resulted in more than 100.000 different transcripts) and then they performed differential expression analysis with DESeq. The transcripts were identified using NR, NT, Pfam, GO, KEGG databases and I have the information of the first 10 best hits for each (if present). I also have information on the coding potential of all of the transcripts (done with BLASTX and TransDecoder, resulting in 37.000 sequences with coding potential).
I also have some mRNA-seq from zebrafish, where reads were all aligned to the known genome (around 35.000 genes). The experiment is the same, so the objective is to compare the differentially expressed genes across the two species.
Now, 100.000 transcripts sounds like a lot, so I was wondering 3 things:
1) is this normal with de novo assemblies and if so what is it due to?
2) since many of these transcripts aligned (with 70-90% similarity) to either intronic or intergenic regions of other species' genomes, would it make sense to only consider the sequences with coding potential (BLASTX/transdecoder results) for differential expression analyses?
3) would it otherwise make sense to get rid of all transcripts with low E-value (BLAST)? So those that were not successfully aligned to any database hit? I've seen papers consider only those with e-value <e-5 and others <e-30
4) in general, what would be the cutoff of E-value up to which I can trust the BLAST result?
I hope I explained myself clearly enough - I'm a bioinformatics/RNAseq newbie but I'd like to learn :) Thank you!!!