Question

Strategies for identifying as many homologs as possible

0

Entering edit mode

5.9 years ago

michael • 0

I am working on a follow-up to a study done in 2006. In it, the authors attempted to identify "all homologs" of a gene (using blastp and tblastn) and they looked at its likely evolution. (The gene is very highly conserved across species). There were a number of interesting questions that the authors could not answer due to a lack of available sequences, which is the gap I am now trying to fill.

I started out with the goal of identifying as many homologs as I could find, this proved to be trivial when using blastp. As of right now using just blastp I have found 30% of the homologs the original paper found (as well as many more that they did not find in 2006). However, I am having worse luck with tblastn:

When I use tblastn against the representative genome database I often encounter a CPU limit error message. That being said, whenever I get results, these are also problematic. For example, the only human hit is the entirety of the chromosome the gene is located on rather than the gene itself. This is obviously problematic for alignment and tree construction purposes since I need to

somehow reduce the total chromosome sequence to just the sequence of the gene, and
I have no idea how to deal with splice sites. For the human this is obviously known, but this is not necessarily the for other homologs. Any ideas on how to solve this?

In the original paper the authors searched just about every database they could get their hands on, one of them being an EST database. I have tried using tblastn against the EST database on ncbi with a different set of difficulties associated:

I get a lot of hits from the same organism (many of these are from different tissues), how can I make sure that only one hit per organism is returned/exclude duplicates?
One of my fears is that some of these results might not be completely assembled. Should I attempt to do this by hand (the original authors approach)? are there any commonly used software tools for this?
Finally, the organism a hit belongs to is not consistently annotated (as is the case with the protein nr database - where the organism name is in square brackets). Is there a way to download the sequences (ideally into excel) with a separate column just for the organism name?

Finally, I am not clear on how people usually deal with multiple sequence alignments of nucleotide and amino acid sequences. Is it standard practice to translate nucleotides into AA before doing an alignment? This would require mRNA data, otherwise, how do people deal with splice sites?

I'm looking for some ideas on how people usually deal with these types of situations, as well as any potential software tools that may help (e.g. I know that BLAST+ may solve my CPU limit problem - I just have not had the time to learn it yet and I'm not sure it is necessary at this point).\

Thank you for your help!

homology blast phylogenetics EST msa • 1.6k views

ADD COMMENT • link updated 5.9 years ago by Michael 54k • written 5.9 years ago by michael • 0

1

Entering edit mode

I’d maybe try using profile HMMs for searching if you’re interested in finding the more remote homologs too

ADD REPLY • link 5.9 years ago by Joe 21k

score 3 · Answer 1 · 2018-06-08

Hi,

first, the number of sequences in the databases has drastically increased since 2006, so I wouldn't be surprised to find more and different results. I am surprised though that you only found about 30% of the hits the 2006 study found in total by blastp. If it is a highly conserved protein, it should be very well annotated also in current genomes. Also, the original article should contain all accessions used, so you could simply get the orthologs by accession. Possibly you could tell us which study and gene you are working with?

Also, I don't think you need to be concerned to miss a few orthologs here and there, I'd rather have a set of manually curated orthologs and invest an additional search effort into the underrepresented taxa.

Some ideas:

When I use tblastn against the representative genome database I often encounter a CPU limit error message.

Limit the search to some taxa, at least exclude bacteria (I assume your gene is present in eukaryotes because of splicing).
Definitely worth trying this query using local blast. You should be able to download the Blast database that is used by NCBI web-blast as well.

somehow reduce the total chromosome sequence to just the sequence of the gene, and I have no idea how to deal with splice sites. For the human this is obviously known, but this is not necessarily the for other homologs. Any ideas on how to solve this?

use the program exonerate to generate the spliced mRNA or translated AA sequence using a good template This should work ok for sequences with high identity.

In the original paper the authors searched just about every database they could get their hands on, one of them being an EST database.

ESTs is somewhat a dated concept since we have NGS, possibly you could search transcriptome assemblies instead?

I get a lot of hits from the same organism (many of these are from different tissues), how can I make sure that only one hit per organism is returned/exclude duplicates?

Retrieve the taxon ids together with your blast results, then sort by it so you get only the top scoring hit(s) per taxid

One of my fears is that some of these results might not be completely assembled. Should I attempt to do this by hand (the original authors approach)? are there any commonly used software tools for this?

You could have a minimum length for the residues or sequence (e.g. 80% of the full length), however if the coverage is limited for some taxa, I would rather accept a fragmented sequence over none.

Finally, the organism a hit belongs to is not consistently annotated (as is the case with the protein nr database - where the organism name is in square brackets). Is there a way to download the sequences (ideally into excel) with a separate column just for the organism name?

That is solved by the taxids when running local blast and defining an output format, you possibly need to download the NCBI taxon database.

All this can be best solved by some scripting, please let me know if this answer helped.