Question: De Novo Transcriptome And Comparative Genomics
gravatar for Am
7.3 years ago by
Am0 wrote:

Hello all, First time posting here. I am looking at doing some comparative genomics with a mammalian transcriptome (no genome available). We have Illumina GAII 76bp paired end library, I have assembled the transcriptome with Trinity, blasted to refseq, and annotated using BLAST2GO. I am looking for suggestions (or best tool) to take me through the next step. I would like to align this transcriptome (using MUSCLE or PRANK) to multiple species. I am wondering if I need to make a consensus transcriptome that is non-redundant that can be used for down stream analyses or if there is another way. Any advice would be very much appreciated. Thank you in advance.

rna-seq genomics comparative • 5.1k views
ADD COMMENTlink written 7.3 years ago by Am0

What do you mean by align to multiple species? Are these complete assembled transcriptome? How many transcripts and reads do you have, that will influence the tools that can be used?

ADD REPLYlink written 7.3 years ago by Joseph Hughes2.7k

Thanks for your response. I have ~16 million paired end reads that have been assembled into ~82,000 contigs (n50=870). As for completion of the transcriptome, this was a de novo assembly, and I believe I have decent coverage. My end goal is to run PAML, using the coding sequences of my newly assemble transcriptome, along with about 5 other species. To run PAML, the sequences for each species need to be aligned in a program, such as MUSCLE or PRANK. My problem is that I may have multiple contigs for one coding sequence in my de novo assembled transcriptome, alignment to other species coding sequences may get messy. So, I was thinking one way around this would be to assemble a consensus transcriptome, with 1 contig per "gene" or coding sequence. But of course if there is another route, I am open to suggestions.

ADD REPLYlink written 7.3 years ago by Am0
gravatar for Ari
7.3 years ago by
Ari90 wrote:

We have addressed exactly that question in a study that will be published soon. You can find a preprint of the paper on our program's (called PAGAN) home page at

Our approach was to use existing reference alignments and trees (e.g. Ensembl GeneTrees), infer the sequence history for the reference alignments and then "insert" new sequences/fragments into the reference alignments by aligning them against the most similar target sequences. Importantly, the target sequences can be either extant sequences or ancestral sequences, the latter being inferred using a phylogeny-aware algorithm similar to that of PRANK.

A big advantage of using reference alignments (and not single reference sequences) is the additional phylogenetic information coming from multiple sequences; this is especially helpful if the new sequences come from a species that has no close relatives with genome sequences available. An additional advantage of using alignments of gene families (such as Ensembl GeneTrees) is that one can separate fragments coming from close paralogues: in addition to aligning the fragments to the reference alignment, PAGAN can connect fragments placed to the same paralogue to longer contigs.

In fact, when starting the project we were thinking of the use of RNA-seq data for comparative evolutionary analyses. As a result, the first version of PAGAN assumed that one always knows from which species the data come from and that the phylogenetic positions for the fragments are constrained. Often that is not the case and we later implemented the necessary functions to search for the optimal placement. This seems to work fine and PAGAN can also be used for metagenomic studies of fairly large datasets.

In our paper we focus on sequence placement (or alignment extension) and show that PAGAN handles well fragments of very different length and evolutionary divergence. We tested PAGAN with DNA and protein data. To some extent it also supports translated alignment. Please contact me if you are interested to know more about that.

ADD COMMENTlink written 7.3 years ago by Ari90

Thank you, this is EXACTLY what I am looking for. I would love to know more about it. A few quick questions after browsing the home page. Is there an option to use a fasta file (already pre-assembled into contigs) or is it best to use fastq sequencing reads? Also, I actually have 2 de novo transcriptomes of different, but closely related mammalian species. Is it possible to still implement PAGAN? Maybe run one, and then run the other?

Thank so much in advance.

ADD REPLYlink written 7.3 years ago by Am0

(1) Yes, it makes sense to first assemble the reads and then align the fasta-formatted contigs. (2) It's fine to analyse two species at one go. If they are placed to different phylogenetic positions, there's no difference to analysing them separately; even if they are sister species, the graph structure should be able to capture the difference they have.

ADD REPLYlink written 7.3 years ago by Ari90
gravatar for Vitis
7.3 years ago by
New York
Vitis2.2k wrote:

I think you probably want to do a 'mini' assembly: align your de novo contigs to a closely related taxon for which the proteins are well annotated, then put your contigs together around each protein. This would help in two ways: assemble the contigs that come from the same gene based on the protein 'reference', i. e. dealing with the fragmentation of de novo assembly, and collapse the redundant contigs for the same gene, which is exactly what you want. There might be existing pipelines to do this, but you can always glue tools like blastx, cap3, etc. together with perl and have your own workflow.

ADD COMMENTlink written 7.3 years ago by Vitis2.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 978 users visited in the last hour