Retrieve homologs of certain protein from large database
0
1
Entering edit mode
6.2 years ago
biotech ▴ 540

Because my previous questions are too broad: Automate protein family analysisScale up paralog synteny in 200 bacterial genomes.

My new first objective is to annotate domains of a selected member of protein family, here named Member_1 to clarify. So seems I have to retrieve all homologs for Member_1 in the database , then run corresponding tool to annotate domains.

Query is Member_1. Some caveats are:

-Gene might have nucleotide variations what has implications for a BLAST strategy

-Gene is a member of large protein family, so I need to be accurate

Since I know up- and down-stream genes for Member_1, I thought about:

-finding 5',  Member_1 and 3' genes in target genomes. 

​-print nucleotide sequence of Member_1 targets, including 5' and 3' genes

-for each retrieved sequence, predict protein domains.

Here you can see the 13 members of the family (paralogs):

domain • 1.7k views
ADD COMMENT
0
Entering edit mode

It is not really an answer but:

  • you can't always assume the gene will be located in the same place sorry, missed the term 'synteny'...
  • is there a lot of selection pressure on these genes?
  • http://www.pantherdb.org/ : you won't get all the genomes you want but it can be a starting place to find orthologs
  • if there are a limited number of proteins BLAST can be nice. However in your case you should take a good look at HMMER http://hmmer.janelia.org/. You can even detect remote homologs, all you need are some examples of the gene family you want.

You can detect candidate genes with HMMER. You can then try to do some phylogeny work and alignments to see how a particular gene evolves or if he is conserved, or even if some of your gene cluster is duplicated.

ADD REPLY
0
Entering edit mode

Thanks for suggesting HMMER. Since proteins of the family are quite similar (in a previous work we organized this complexity by grouping them in 13 groups based on certain 3'  protein domain), I thought about doing a HMMER search using nucleotide sequence,  but including also 5' and 3' genes, this way best hit will be sure belong to the true homolog, with very significant e-values compared to paralogs. If I do HMMER search with only the protein sequence, I will also get as target paralogs with near e-values. Note that nucleotide search strategy assumes 5' and 3' genes are conserved, what might not always be true. What do you think? I would like to automate this task in an accurate way but is difficult.

ADD REPLY
0
Entering edit mode

I am not sure if you want to get all the proteins of your family or if only a very restricted subset interests you. The nucleotide search might not be best if proteins interests you. I would suggest that:

  • you build an alignment and hmm of the proteins that interests you. Find some orthologs in pantherdb for starters or use your own
  • head over to http://hmmer.janelia.org/search/hmmsearch You can then search using the taxonomy ids of your bacteria, and use advanced parameters to get more information on your hits. You can try to download all the results and take the 20-30 best values for each organism.
    Scale up later by using your own sequence database and a command line search for each organism; use the preceding search results to create a new HMM if needed.
  • try to classify the hits: are they part of your synteny group, do they match your protein of interest? You can simply parse the GIs in the hmmscan results and then grab the 3' part of your candidates from Biomart. Doing long alignments or juggling with 200 genome annotation files will be unpleasant.

If you try to grab the 'true' homolog at once, you may miss things - and jump to false conclusions. You can  use more restrictive conditions after a first pass. You also want to get some representatives of your proteins in quite a few representative genomes before doing a serious search. Find the paralogs anyway to check the importance of your synteny group.

Compare if you want  the e-value of being the correct protein to the e-value of being part of the correct protein family. Try to compare the 3' domain you get. If you want you can also get the nearest features using bedtools, but you need to have 200 annotations... 

ADD REPLY
0
Entering edit mode

Seems feasible, but I got lost in step 3. Classifying hits will be essential in order to compare and perform evolutionary analysis by group (there are three groups based on 3' end), so I will have to look for proteins having very similar 3' end. Maybe Biomart is not necessary and just building three more hmms and search and divide proteins retrieved from step 2 into the three desired groups would be feasible. What do you think? I'm not familiar with Biomart.

ADD REPLY
0
Entering edit mode

Well, you are more experienced than me  on this protein. The only thing you really need to do is not discard all the paralogs too quickly. For step 3, you need to find more stringent properties which will allow you to rank your hits. I can't really give you advice, just that you should try to see how your methods perform on a small known data set. Take a genome you have previously studied (and preferably not included in your HMM) and see what you can do on it.

ADD REPLY
0
Entering edit mode

There is selection pressure in all protein sequence, but specially in the 3' end domain.

ADD REPLY

Login before adding your answer.

Traffic: 2240 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6