Question

How Can I Identify Orthologous Contigs Between Two De Novo Transcriptome Assemblies?

4

Entering edit mode

12.9 years ago

Ryan Thompson ★ 3.6k

I am doing de novo transcriptome assembly of RNA-Seq data from two closely-related diploid species (mammals) for the purpose of identifying genetic variations between the two species. In order to do this, I suppose I need to identify pairs of ortholog transcripts between the two assemblies, so that I can compare them. What is the best way to do this? Should I simply do all pairwise alignments and pick out the pairs that are best matches to each other? Are there tools available for this already?

Additionally ,how does the presence of heterozygous SNPs affect the strategy? I am using Trinity for the transcriptome assembly, and my understanding is that when a transcript has a heterozygous SNP, Trinity will end up reporting two complete contigs that are identical except for the SNP. For example, if the transcript is "TTTTTTTTTT" and there is a heterozygous A/T at position 6, then Trinity would report "TTTTTTTTTT" and "TTTTTATTTT". This could potentially complicate the identification of ortholog pairs by a "mutual best match" strategy described above.

transcriptome assembly denovo trinity orthologues • 5.7k views

ADD COMMENT • link updated 12.9 years ago by Vitis ★ 2.5k • written 12.9 years ago by Ryan Thompson ★ 3.6k

score 1 · Answer 1 · 2011-12-06

1

Entering edit mode

12.9 years ago

Damian Kao 16k

[?]OrthoMCL[?] is a popular package used to find orthologous groups.

ADD COMMENT • link 12.9 years ago by Damian Kao 16k

score 1 · Answer 2 · 2011-12-06

I am not familiar with Trinity, but would suggest looking into relaxing its stringency for separating the A and T transcripts in your example so that these are reported as alleles of one transcript. That's one goal of the output, right?

If that cannot be done, then you need to be able to identify one member of the heterozygous transcript pair as such and remove it from the orthologous gene-finding step. You can do this with BLAST via mutual best hit, or with other tools such as OrthoMCL. If both organisms have (reference) genome sequence, then these assignments likely have been calculated already. Thus, when there is a heterozygous genotype, one transcript is set aside but labeled as an alternate allele/genotype/haplotype for a given gene/transcript and the other allele is used as query in the ortholog search.

You will also need to develop a strategy to deal with the transcripts that have no orthologous match. Are these unique to one species? Is there really an ortholog but which is not expressed or not detected in your data? Will you have situations of gene A in species A being orthologous to genes A1 and A2 in species B? In many of these cases, you may still be able to detect variants in the transcripts with the 1:1 ortholog relationship.

score 1 · Answer 3 · 2011-12-06

1

Entering edit mode

12.9 years ago

Vitis ★ 2.5k

In terms of the heterozygosity, I think it's a much more complicated problem than orthologous gene identification. If it's genomic sequencing, there is a way to see level of heterozygosity by checking k-mer frequency distribution. This is well implemented in Quake. But transcriptome has inherently uneven coverage, which can span very big range. So this method is not really working for transcriptome. In this situation, the best solution I can think of is using contig assembly software Phrap or cap3 to assemble the de novo contigs, hope the mismatches allowed there would capture the heterozygosity. And if your two species are close enough evolutionarily, you may just do de novo assembly for one, followed by contig assembly, or even reference-protome based improvement, then map the reads from the other organism on top of the first one.

ADD COMMENT • link 12.9 years ago by Vitis ★ 2.5k

0

Entering edit mode

Phrap and cap3 are good choices to assemble contigs. You would not, in my mind, want to mix reads from different species for mRNA assembly and ortholog identification.

ADD REPLY • link 12.9 years ago by Larry_Parnell 16k

0

Entering edit mode

I didn't made myself clear. What I meant was to use de novo and phrap/cap3 to assemble just ONE of them, making a solid reference transcriptome. Then if the other one is close enough, maybe it's feasible to map the reads using the first one as reference.