Hey, I currently have problems evaluating my trinity assemblies, so I hope someone can help me here.
I am working with paired-end RNAseq data (roughly 600 million reads in total) of Pseudomonas putida. I used de-novo Trinity with default in silico normalisation (coverage = 50x) and RF read orientation. I also used the genome-guided trinity mode with a reference genome of a closely related P. putida strain. After finishing the assemblies I run Trinity's basic statistics and it reports 8000 "Trinity genes" and 12000 transcripts. While the transcripts shouldn't be a problem, I am wondering about the gene counts. I would have expected about 5000 genes for my strain.
Before I used in-silico normalization I have used a much higher coverage for the assembly and this resulted in even more genes (about 11000).
I'm wondering how this is caused (assembling the same region/genes more than once). I have used Uclust on one assembly and it reported lot of sequences with a high identity (>95%) to other sequences in the assembly.
I'm planning to do a differential expression analysis so how could this affect it.
Thank you very much for any help.
Greetings, Alexander
Since you don't expect to have any splicing going on have you tried to do a SPAdes assembly of the data. Since bacteria are gene dense you should get a good assembly but there may be some gaps since this is RNAseq data. You have way too much data for a small genome so use a smaller amount with SPAdes.
You could also map the reads to known P. putida genome(s) (there must be some in Genbank) and create a consensus from that alignment.
There have to have Pseudomonas putida reference genomes to use. A de novo assembly with Spades and a comparison with a trusted Pseudomonas genome using a tool like Mauve can give you confidence in using your own assembled genome for your purpose
But I am wondering what to do in case you use this approach in a plant genome, several Gb of lenght, from which a reference genome is not available. In this case., Trinity will give you several copies of transcripts with few differences that could correspond to either isoforms or different assemblies
Alexander's is a very good question, because in this case,if you want to to DE analysis, you will be ending with counts mapped to several contigs thar usually are being discarded for the following analysis