Question: How Can I Identify Orthologous Contigs Between Two De Novo Transcriptome Assemblies?
gravatar for Ryan Thompson
9.2 years ago by
Ryan Thompson3.5k
TSRI, La Jolla, CA
Ryan Thompson3.5k wrote:

I am doing de novo transcriptome assembly of RNA-Seq data from two closely-related diploid species (mammals) for the purpose of identifying genetic variations between the two species. In order to do this, I suppose I need to identify pairs of ortholog transcripts between the two assemblies, so that I can compare them. What is the best way to do this? Should I simply do all pairwise alignments and pick out the pairs that are best matches to each other? Are there tools available for this already?

Additionally ,how does the presence of heterozygous SNPs affect the strategy? I am using Trinity for the transcriptome assembly, and my understanding is that when a transcript has a heterozygous SNP, Trinity will end up reporting two complete contigs that are identical except for the SNP. For example, if the transcript is "TTTTTTTTTT" and there is a heterozygous A/T at position 6, then Trinity would report "TTTTTTTTTT" and "TTTTTATTTT". This could potentially complicate the identification of ortholog pairs by a "mutual best match" strategy described above.

ADD COMMENTlink written 9.2 years ago by Ryan Thompson3.5k
gravatar for Damian Kao
9.2 years ago by
Damian Kao15k
Damian Kao15k wrote:

[?]OrthoMCL[?] is a popular package used to find orthologous groups.

ADD COMMENTlink written 9.2 years ago by Damian Kao15k
gravatar for Larry_Parnell
9.2 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

I am not familiar with Trinity, but would suggest looking into relaxing its stringency for separating the A and T transcripts in your example so that these are reported as alleles of one transcript. That's one goal of the output, right?

If that cannot be done, then you need to be able to identify one member of the heterozygous transcript pair as such and remove it from the orthologous gene-finding step. You can do this with BLAST via mutual best hit, or with other tools such as OrthoMCL. If both organisms have (reference) genome sequence, then these assignments likely have been calculated already. Thus, when there is a heterozygous genotype, one transcript is set aside but labeled as an alternate allele/genotype/haplotype for a given gene/transcript and the other allele is used as query in the ortholog search.

You will also need to develop a strategy to deal with the transcripts that have no orthologous match. Are these unique to one species? Is there really an ortholog but which is not expressed or not detected in your data? Will you have situations of gene A in species A being orthologous to genes A1 and A2 in species B? In many of these cases, you may still be able to detect variants in the transcripts with the 1:1 ortholog relationship.

ADD COMMENTlink written 9.2 years ago by Larry_Parnell16k
gravatar for Vitis
9.2 years ago by
New York
Vitis2.4k wrote:

In terms of the heterozygosity, I think it's a much more complicated problem than orthologous gene identification. If it's genomic sequencing, there is a way to see level of heterozygosity by checking k-mer frequency distribution. This is well implemented in Quake. But transcriptome has inherently uneven coverage, which can span very big range. So this method is not really working for transcriptome. In this situation, the best solution I can think of is using contig assembly software Phrap or cap3 to assemble the de novo contigs, hope the mismatches allowed there would capture the heterozygosity. And if your two species are close enough evolutionarily, you may just do de novo assembly for one, followed by contig assembly, or even reference-protome based improvement, then map the reads from the other organism on top of the first one.

ADD COMMENTlink written 9.2 years ago by Vitis2.4k

Phrap and cap3 are good choices to assemble contigs. You would not, in my mind, want to mix reads from different species for mRNA assembly and ortholog identification.

ADD REPLYlink written 9.2 years ago by Larry_Parnell16k

I didn't made myself clear. What I meant was to use de novo and phrap/cap3 to assemble just ONE of them, making a solid reference transcriptome. Then if the other one is close enough, maybe it's feasible to map the reads using the first one as reference.

ADD REPLYlink written 9.2 years ago by Vitis2.4k

Yes, a much clearer approach. I could agree to give that a try.

ADD REPLYlink written 9.2 years ago by Larry_Parnell16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1465 users visited in the last hour