Question: Pipeline for comparing gene orthologs
Hi! I've started learning bioinformatics this year and have recently started a PhD looking at a genome assemblies of highly nutrient efficient Australian plants. I'm wanting to compare the protein sequences of different nutrient transporters with other species currently sequenced. I'm wondering if there is a pipeline that automates something similar to below, or maybe if there are better ways to get a measure of how divergent a protein family is compared to other species.

  1. BLAST search in NCBI database for desired protein family (can set similarity level e.g. 80%). Based on protein name.
  2. Record the number of protein paralogs and paralog similarity for species with their whole genome sequenced. Identify motifs for all the protein sequences (using MEME).
  3. Generate several consensus sequences for the protein sequences based on high similarity between proteins. This reduces the number of queries I have to run while still maintaining diversity to match against any protein sequences in my genome assembly that could be highly divergent.
  4. BLAST search consensus sequences in my genome assembly (allowing matches to multiple loci).
  5. BLAST a number of the top hits at each hit loci in my genome assembly back into NCBI to confirm they are orthologs of the desired protein family.
  6. Compare similarity, number, and conservation of motifs in extracted/translated protein sequences from my genome assembly to protein sequences of other species. Either lots of figures or quantitive values summarising all these together.
