Entering edit mode
7.9 years ago
biotech ▴ 560
Because my previous questions are too broad: Automate protein family analysis, Scale up paralog synteny in 200 bacterial genomes.
My new first objective is to annotate domains of a selected member of protein family, here named Member_1 to clarify. So seems I have to retrieve all homologs for Member_1 in the database, then run corresponding tool to annotate domains.
Query is Member_1. Some caveats are:
- Gene might have nucleotide variations what has implications for a BLAST strategy
- Gene is a member of large protein family, so I need to be accurate
Since I know up- and down-stream genes for Member_1, I thought about:
- finding 5', Member_1 and 3' genes in target genomes.
- print nucleotide sequence of Member_1 targets, including 5' and 3' genes
- for each retrieved sequence, predict protein domains.
Here you can see the 13 members of the family (paralogs):
It is not really an answer but:
you can't always assume the gene will be located in the same placesorry, missed the term 'synteny'...
You can detect candidate genes with HMMER. You can then try to do some phylogeny work and alignments to see how a particular gene evolves or if he is conserved, or even if some of your gene cluster is duplicated.
Thanks for suggesting HMMER. Since proteins of the family are quite similar (in a previous work we organized this complexity by grouping them in 13 groups based on certain 3' protein domain), I thought about doing a HMMER search using nucleotide sequence, but including also 5' and 3' genes, this way best hit will be sure belong to the true homolog, with very significant e-values compared to paralogs. If I do HMMER search with only the protein sequence, I will also get as target paralogs with near e-values. Note that nucleotide search strategy assumes 5' and 3' genes are conserved, what might not always be true. What do you think? I would like to automate this task in an accurate way but is difficult.
I am not sure if you want to get all the proteins of your family or if only a very restricted subset interests you. The nucleotide search might not be best if proteins interests you. I would suggest that:
Head over to http://hmmer.janelia.org/search/hmmsearch. You can then search using the taxonomy ids of your bacteria, and use advanced parameters to get more information on your hits. You can try to download all the results and take the 20-30 best values for each organism.
Scale up later by using your own sequence database and a command line search for each organism; use the preceding search results to create a new HMM if needed.
If you try to grab the 'true' homolog at once, you may miss things - and jump to false conclusions. You can use more restrictive conditions after a first pass. You also want to get some representatives of your proteins in quite a few representative genomes before doing a serious search. Find the paralogs anyway to check the importance of your synteny group.
Compare if you want the e-value of being the correct protein to the e-value of being part of the correct protein family. Try to compare the 3' domain you get. If you want you can also get the nearest features using bedtools, but you need to have 200 annotations...
Seems feasible, but I got lost in step 3. Classifying hits will be essential in order to compare and perform evolutionary analysis by group (there are three groups based on 3' end), so I will have to look for proteins having very similar 3' end. Maybe Biomart is not necessary and just building three more hmms and search and divide proteins retrieved from step 2 into the three desired groups would be feasible. What do you think? I'm not familiar with Biomart.
Well, you are more experienced than me on this protein. The only thing you really need to do is not discard all the paralogs too quickly. For step 3, you need to find more stringent properties which will allow you to rank your hits. I can't really give you advice, just that you should try to see how your methods perform on a small known data set. Take a genome you have previously studied (and preferably not included in your HMM) and see what you can do on it.
There is selection pressure in all protein sequence, but specially in the 3' end domain.