Hello all, First time posting here. I am looking at doing some comparative genomics with a mammalian transcriptome (no genome available). We have Illumina GAII 76bp paired end library, I have assembled the transcriptome with Trinity, blasted to refseq, and annotated using BLAST2GO. I am looking for suggestions (or best tool) to take me through the next step. I would like to align this transcriptome (using MUSCLE or PRANK) to multiple species. I am wondering if I need to make a consensus transcriptome that is non-redundant that can be used for down stream analyses or if there is another way. Any advice would be very much appreciated. Thank you in advance.
We have addressed exactly that question in a study that will be published soon. You can find a preprint of the paper on our program's (called PAGAN) home page at http://code.google.com/p/pagan-msa/wiki/PAGAN?tm=6.
Our approach was to use existing reference alignments and trees (e.g. Ensembl GeneTrees), infer the sequence history for the reference alignments and then "insert" new sequences/fragments into the reference alignments by aligning them against the most similar target sequences. Importantly, the target sequences can be either extant sequences or ancestral sequences, the latter being inferred using a phylogeny-aware algorithm similar to that of PRANK.
A big advantage of using reference alignments (and not single reference sequences) is the additional phylogenetic information coming from multiple sequences; this is especially helpful if the new sequences come from a species that has no close relatives with genome sequences available. An additional advantage of using alignments of gene families (such as Ensembl GeneTrees) is that one can separate fragments coming from close paralogues: in addition to aligning the fragments to the reference alignment, PAGAN can connect fragments placed to the same paralogue to longer contigs.
In fact, when starting the project we were thinking of the use of RNA-seq data for comparative evolutionary analyses. As a result, the first version of PAGAN assumed that one always knows from which species the data come from and that the phylogenetic positions for the fragments are constrained. Often that is not the case and we later implemented the necessary functions to search for the optimal placement. This seems to work fine and PAGAN can also be used for metagenomic studies of fairly large datasets.
In our paper we focus on sequence placement (or alignment extension) and show that PAGAN handles well fragments of very different length and evolutionary divergence. We tested PAGAN with DNA and protein data. To some extent it also supports translated alignment. Please contact me if you are interested to know more about that.
I think you probably want to do a 'mini' assembly: align your de novo contigs to a closely related taxon for which the proteins are well annotated, then put your contigs together around each protein. This would help in two ways: assemble the contigs that come from the same gene based on the protein 'reference', i. e. dealing with the fragmentation of de novo assembly, and collapse the redundant contigs for the same gene, which is exactly what you want. There might be existing pipelines to do this, but you can always glue tools like blastx, cap3, etc. together with perl and have your own workflow.