Hey, I currently have problems evaluating my trinity assemblies, so I hope someone can help me here.
I am working with paired-end RNAseq data (roughly 600 million reads in total) of Pseudomonas putida. I used de-novo Trinity with default in silico normalisation (coverage = 50x) and RF read orientation. I also used the genome-guided trinity mode with a reference genome of a closely related P. putida strain. After finishing the assemblies I run Trinity's basic statistics and it reports 8000 "Trinity genes" and 12000 transcripts. While the transcripts shouldn't be a problem, I am wondering about the gene counts. I would have expected about 5000 genes for my strain.
Before I used in-silico normalization I have used a much higher coverage for the assembly and this resulted in even more genes (about 11000).
I'm wondering how this is caused (assembling the same region/genes more than once). I have used Uclust on one assembly and it reported lot of sequences with a high identity (>95%) to other sequences in the assembly.
I'm planning to do a differential expression analysis so how could this affect it.
Thank you very much for any help.