How To Prove An Ortholog Does Not Exist In A Given Genome?
4
18
Entering edit mode
10.7 years ago
Nicojo ★ 1.1k

I am wondering what is the best way to prove that an ortholog is absent (or at least un-identifiable) from a given sequenced genome.

My initial thought goes to a tblastx search, where an absence of characteristic segments of the query sequence would indicate there is no identifiable ortholog.

Can you think of other (complementary or better) ways to show this?

orthologues sequence genome blast • 4.9k views
ADD COMMENT
1
Entering edit mode

Generate some generic PCR primers! Bench science has its place at times :->

ADD REPLY
0
Entering edit mode

@Alastair: thanks for the suggestion! It is certainly one way to go in many cases. Unfortunately not in my case as the genes I'm interested in have never (even remotely) been found in any other species. So designing degenerate primers is even worse than looking for a needle in a haystack. Basically one of the referees on my manuscript wants proof they do not exist in my negative control data set...

ADD REPLY
20
Entering edit mode
10.7 years ago

In my view, there could be three different reasons why a normal sequence search, e.g. with BLASTP, against the annotated protein coding genes in a genome would fail to identify an ortholog when there should be one:

  1. The region of the DNA within which the gene is located is missing in the genome assembly
  2. The gene finding pipeline failed to annotate the gene despite it being there
  3. The protein sequence has diverged too much for you to be able to identify the ortholog

To convincingly prove that a gene really does not have an ortholog in a given genome, you need to somehow justify that neither of these three are likely to be the case.

To rule out point 1, you need to evaluate the genome assembly. Has the genome been assembled into complete chromosomes, or does it consist of a large number of contigs. In the former case, it is fairly unlikely that big chunks of DNA would be missing, but in the latter case you will have difficulties ruling out that the gene of interest might simply not have been sequenced.

To rule out point 2, you could do what you suggested yourself: run BLASTX against the genomic DNA sequence. If you do not find a good hit to your query sequence, it is fairly unlikely that the ortholog was not identified due to errors in the gene finding process.

To rule out point 3, you need to show that the gene of interest does mutate faster than other genes for which you are able to find an ortholog. You can do this by analyzing orthologs across a number of species and estimating mutation rates (or simply looking at percent identity). If your sequence does not appear to diverge faster than other sequences for which you managed to find orthologs, sequence divergence is unlikely to be a concern. As a side note, if divergence is an issue, you might want to look at gene synteny as a way to identify the ortholog.

If all of the three reasons for why an ortholog might not be found can be ruled out, you can reasonably argue that there appears to be no ortholog. Otherwise, you can only conclude that you could not identify one.

ADD COMMENT
0
Entering edit mode

@Lars: thanks, very nice answer! Indeed, the completeness of the assembly is key to stating an absence of identifiable ortholog and an important point to make. Otherwise, you seem to agree with me that blast would be the way to go... Can you think of any other tools or analysis that could be helpful?

ADD REPLY
0
Entering edit mode

My only other suggestions are what has already been suggested by others: 1) check various databases or orthologs/homologs (COG, eggNOG, TreeFam, OrthoMCL, HomoloGene), and 2) if you can find homologs but want to prove that these are not orthologs, you need to make a multiple sequence alignment (using muscle, mafft, t-coffee) and use that as the basis for constructing a proper phylogenetic tree (e.g. using PhyML).

ADD REPLY
0
Entering edit mode

@Lars: Thanks, that's exactly what I will do! Cheers

ADD REPLY
0
Entering edit mode

Does that mean you will accept the answer? Hint, hint ;-)

ADD REPLY
0
Entering edit mode

I was waiting to see if some good alternative to blast showed up... Lucky you: it didn't ;)

ADD REPLY
8
Entering edit mode
10.7 years ago
Yannick Wurm ★ 2.3k

blast only works for well conserved things. Lets say you couldnt find your gene by blast, maybe the next step is to try and HMM search: make a multiple alignment of the orthologs you know, create an HMM model (or possibly download one that already exists on PFAM), run HMMER (http://hmmer.janelia.org) against the 6-frame translation of your genome. Still nothing? Check ESTs. Check the unassembled reads.

Another idea: synteny is often conserved. Check if you find the genes to the left and right of where your otholog should be. Then do the blastx and HMMER searches specifically in that region. (you reduce the search space & thus give the algorithms more "power" to detect something significant). If your gene hasn't been "lost" for too long, you could find a degenerate version of it... which will help you prove the point that it has been functionally "lost".

ADD COMMENT
4
Entering edit mode

Some good ideas, however, the world is not quite so black and white. HMMs are not always better than BLAST. If you are trying to find orthologs between two genomes that are more closely related to each other than they are to anything else that has been sequenced, an HMM based on more distant homologs may perform considerably worse than a simple BLAST search. Also, a 6-frame translation of the genome is not a good replacement for BLASTX unless you are looking at bacterial genomes. BLASTX can stitch together multiple exons and hence may have much better power.

ADD REPLY
0
Entering edit mode

@Yannick: actually, we designed HMMs to recognize specific sub-groups of a protein family that is unique to two species. But one of the reviewers wants us to prove that this protein family is absent from other species... So this approach will not help me. But it may be useful to others.

ADD REPLY
0
Entering edit mode

@Yannick: regarding the Pfam HMM that exists for my protein family, it does not find anything in other species. The HMMs we've designed sort out the Pfam family into sub-groups. Our HMMs also pick up more sequences than the Pfam does.

ADD REPLY
0
Entering edit mode

@Yannick: regarding the synteny, this is an excellent approach indeed. Unfortunately, as mentioned above, my genes do not seem to have orthologues in other species. Also, the surrounding genes don't either, so it is not possible to check for syntheny. My problem is not "to find the genes" but to prove that they are nowhere to be found ;)

ADD REPLY
0
Entering edit mode

@Nicojo: sounds like a challenge to go further with what you currently have. Can you sequence a lane of HiSeq of a relative that may be half way in between your genome and the relative you've been looking through?

ADD REPLY
0
Entering edit mode

@Yannick: indeed, it is a challenge ;) Regarding the sequencing of closely related genomes, well, those are the ones I need to prove they're missing from! I left the question rather general without going into the details of my specifics so that the answers could help lots more people than just me... Could you edit your answer to add the sequencing of a close relative as a possibility? Could be a useful suggestion for others!

ADD REPLY
2
Entering edit mode
10.7 years ago

Good answers above. I can only add that things get a bit tricky in the situation of small (parts of) gene families. Let's say organism A has 3 very similar genes and organism B has 2. These will definitely be found by BLAST searches but there may be confusion when applying the rules of orthologous pairs. Thus, matching the orthologs can be difficult but not always. In this case (when the proteins are very similar and the next bet hits are clearly not the orthologs), a good multiple alignment and phylogenetic analysis may help. It may turn out that these genes don't pass the ortholog rule of mutual best hit by sequence comparison such as BLAST. In that case, perhaps it is best to say that the group of 3 genes is orthologous to the group of 2. On the other hand, the phylogenetic analysis may tell you that there is one or are two orthologous pairs from this organism A/organism B example.

Subsets of the cytochrome P450 super-family offer great examples of this.

ADD COMMENT
0
Entering edit mode

@Larry: if I understand correctly you are saying that a certain number of the BLAST hits should be retrieved and aligned to the proteins of interest, distance matrix and tree built. From that you would make conclusions? Could you edit your answer to make your meaning clearer?

ADD REPLY
0
Entering edit mode

@Larry: sorry, I just reread my comment... it's not very clear. Let me know if you get what I mean or if you'd like me to clarify ;)

ADD REPLY
2
Entering edit mode
10.7 years ago
Dror ▴ 280

For verify that there is no ortholog, I would do the follow:

first note all the conserved features that you are looking for. decide - what you would expect to be in an ortholog. these criteria will be used for dismissing non-orthologs. Then:

  1. blastp your sequence against the genome of the relevant organism. Then reciprocally check the best results against the orginal orgaism that you started with. - If you get a lot of off targets results, this means that you can't find the ortholog.
  2. next, tblastx against all the resources, including ests - make sure you can't find anything in the non-genomic databases.
  3. Check in orthoMCL - can you find an ortolog in their ortholog groups? - their database is pretty good for many proteins and many organisms.
  4. Can you relate this gene to other genes (in the same pathway, interacting, targets etc.), if you can, check whether you can find their orthologs. - If all the pathway is conserved, I would say that there is no ortholog so easy.4.
  5. finally, can you find orthologs in close related species? - If you can, it might be that the genome sequence annotation of your genome is not good enough and you can't find the ortholog due to a technical problem.

That said, I was surprised to find missing orthologs several times, and think that even a very important genes can be lost. Good luck.

ADD COMMENT
0
Entering edit mode

@Dror: thanks for your answer. BLAST seems to be a quite consensual answer. Adding the search for related genes can indeed prove useful in some cases (unfortunately not mine). OrthoMCL and similar tools can also prove useful (nothing found in my case though). Finally, of course your point #5 is a good control (although, again not in my situation).

ADD REPLY

Login before adding your answer.

Traffic: 1990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6