Question: Strategies for identifying as many homologs as possible
0
gravatar for michael
9 months ago by
michael0
michael0 wrote:

I am working on a follow-up to a study done in 2006. In it, the authors attempted to identify "all homologs" of a gene (using blastp and tblastn) and they looked at its likely evolution. (The gene is very highly conserved across species). There were a number of interesting questions that the authors could not answer due to a lack of available sequences, which is the gap I am now trying to fill.

I started out with the goal of identifying as many homologs as I could find, this proved to be trivial when using blastp. As of right now using just blastp I have found 30% of the homologs the original paper found (as well as many more that they did not find in 2006). However, I am having worse luck with tblastn:

When I use tblastn against the representative genome database I often encounter a CPU limit error message. That being said, whenever I get results, these are also problematic. For example, the only human hit is the entirety of the chromosome the gene is located on rather than the gene itself. This is obviously problematic for alignment and tree construction purposes since I need to

  1. somehow reduce the total chromosome sequence to just the sequence of the gene, and
  2. I have no idea how to deal with splice sites. For the human this is obviously known, but this is not necessarily the for other homologs. Any ideas on how to solve this?

In the original paper the authors searched just about every database they could get their hands on, one of them being an EST database. I have tried using tblastn against the EST database on ncbi with a different set of difficulties associated:

  1. I get a lot of hits from the same organism (many of these are from different tissues), how can I make sure that only one hit per organism is returned/exclude duplicates?
  2. One of my fears is that some of these results might not be completely assembled. Should I attempt to do this by hand (the original authors approach)? are there any commonly used software tools for this?
  3. Finally, the organism a hit belongs to is not consistently annotated (as is the case with the protein nr database - where the organism name is in square brackets). Is there a way to download the sequences (ideally into excel) with a separate column just for the organism name?

Finally, I am not clear on how people usually deal with multiple sequence alignments of nucleotide and amino acid sequences. Is it standard practice to translate nucleotides into AA before doing an alignment? This would require mRNA data, otherwise, how do people deal with splice sites?

I'm looking for some ideas on how people usually deal with these types of situations, as well as any potential software tools that may help (e.g. I know that BLAST+ may solve my CPU limit problem - I just have not had the time to learn it yet and I'm not sure it is necessary at this point).\

Thank you for your help!

ADD COMMENTlink modified 9 months ago by Michael Dondrup45k • written 9 months ago by michael0
1

I’d maybe try using profile HMMs for searching if you’re interested in finding the more remote homologs too

ADD REPLYlink written 9 months ago by jrj.healey11k
3
gravatar for Michael Dondrup
9 months ago by
Bergen, Norway
Michael Dondrup45k wrote:

Hi,

first, the number of sequences in the databases has drastically increased since 2006, so I wouldn't be surprised to find more and different results. I am surprised though that you only found about 30% of the hits the 2006 study found in total by blastp. If it is a highly conserved protein, it should be very well annotated also in current genomes. Also, the original article should contain all accessions used, so you could simply get the orthologs by accession. Possibly you could tell us which study and gene you are working with?

Also, I don't think you need to be concerned to miss a few orthologs here and there, I'd rather have a set of manually curated orthologs and invest an additional search effort into the underrepresented taxa.

Some ideas:

When I use tblastn against the representative genome database I often encounter a CPU limit error message.

  • Limit the search to some taxa, at least exclude bacteria (I assume your gene is present in eukaryotes because of splicing).
  • Definitely worth trying this query using local blast. You should be able to download the Blast database that is used by NCBI web-blast as well.

somehow reduce the total chromosome sequence to just the sequence of the gene, and I have no idea how to deal with splice sites. For the human this is obviously known, but this is not necessarily the for other homologs. Any ideas on how to solve this?

  • use the program exonerate to generate the spliced mRNA or translated AA sequence using a good template This should work ok for sequences with high identity.

In the original paper the authors searched just about every database they could get their hands on, one of them being an EST database.

  • ESTs is somewhat a dated concept since we have NGS, possibly you could search transcriptome assemblies instead?

I get a lot of hits from the same organism (many of these are from different tissues), how can I make sure that only one hit per organism is returned/exclude duplicates?

  • Retrieve the taxon ids together with your blast results, then sort by it so you get only the top scoring hit(s) per taxid

One of my fears is that some of these results might not be completely assembled. Should I attempt to do this by hand (the original authors approach)? are there any commonly used software tools for this?

  • You could have a minimum length for the residues or sequence (e.g. 80% of the full length), however if the coverage is limited for some taxa, I would rather accept a fragmented sequence over none.

Finally, the organism a hit belongs to is not consistently annotated (as is the case with the protein nr database - where the organism name is in square brackets). Is there a way to download the sequences (ideally into excel) with a separate column just for the organism name?

  • That is solved by the taxids when running local blast and defining an output format, you possibly need to download the NCBI taxon database.

All this can be best solved by some scripting, please let me know if this answer helped.

ADD COMMENTlink modified 9 months ago • written 9 months ago by Michael Dondrup45k

Hi Michael, thanks so much for your detailed answer!

I agree that I do not need to be too worried about missing a few orthologs here and there. That being said, I would like to have a table stating how many species within different subgroups have a copy of this gene (e.g. 28 Fungi have a homolog). The fact that my search is only finding around 30% of the homologs originally found makes me think that I've missed a lot of homologs in other species. I've also checked the accession numbers listed in the original paper. Many of the ones my search did not turn up are nucleotide accession numbers where the protein does not seem to have been identified (hence it makes sense that I can't find them).

Regarding the exonerate program: what would you consider to be a high enough percentage identity for this to work?

Finally, I'm assuming that the way to go with including taxon id's in my results is via standalone blast?

Thank you again, this was already really helpful!

ADD REPLYlink written 9 months ago by michael0

There is no magic number for homology. It needs case-by-case definition based on the families you are interested in.

If a protein is identical in 4 out of 100 positions, but those 4 are all in the active site, is that the same enzyme or not? You cannot simply state a cutoff.

ADD REPLYlink written 9 months ago by jrj.healey11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2312 users visited in the last hour