I'm embarking on a RNAseq project and am trying to really figure out what the output of the de novo assembly program is before mapping read counts. Briefly, I've used rlsim to simulated Illumina reads for a set of 100 known mRNA transcripts. I've then run a standard pipeline of trimming-filtering the simulated reads, and then generating the assembly using velvet-oases. I also compared the performance of using all reads with digitally-normalized reads using khmer. To evaluate the assembly, I BLASTed the oases assembled transcripts against the known transcripts.
Of the 100 known, I get 58 reconstructed from the assembly using all reads, and 70 from the assembly using diginorm reads. In both cases, about 90% of the length of the original transcript is mapped. Nice!
Here's the problem - for both approaches, I get many more transcripts (234 and 270 for the 'all' and 'diginorm' approaches respectively) than I should (started with 100 transcripts). Needless to say, this causes problems with mapping reads due to multiple alignments.
My solution is to use cd-hit-est which nicely collapses these to 78 and 102 transcripts, respectively.
But shouldn't this not be happening at all? How to I get oases to avoid making these incorrect isoforms in the first place?
What do I do with real data where I likely do have some true isoforms?
FYI - my scripts for this simulation are available at https://github.com/johnstantongeddes/sim-transcriptome though as always, it's a work in progress.