First a little background: I have done a de novo transcriptome assembly of marine snail venom duct tissue using HiSeq Illumina reads. I have 289 mil PE reads of 100bp length that I have digitally normalized prior to assembly. I have used Trinity and Velvet Oases as my assemblers, and have done extensive blasting of Trinity assembly to identify putative marine snail toxins and to obtain a high level overview of GO and KEGG terms for the rest of my hits. I am reasonably happy with results thus far, but as always with de novo assembly, am looking for ways to be sure my transcripts are valid, as I have no reference transcriptome.
Now, luckily, I have been able to obtain some 2x300 MiSeq data, approx 20 mil reads, generated from the HiSeq library. So I am asking how I can make best use of these to improve my assembly.
My current strategy is the following.
Do a separate assembly of the MiSeq reads with Trinity and perhaps VO to compare with HiSeq assemblies (side question: when reads get longer should this affect your choice of kmer values for assembler like VO?)
Map MiSeq reads to HiSeq assembly, especially to confirm transcripts I have identified as putative toxins from HiSeq assembly (NB: these are short, disulfide rich peptides, average length of precursor structure 100 AA)
Blast MiSeq raw reads against database of toxins (on the theory that some of these reads are going to be long enough to cover most if not all of some toxin transcripts)
I am wondering what other strategies might be useful to pursue, or if the above seems to make the best sense. For example, should I combine HiSeq and MiSeq together prior to assembly? I am open to any thoughts on how I can take advantage of the depth of the HiSeq along with the length of the MiSeq to obtain a better assembly.