Hi! I've started learning bioinformatics this year and have recently started a PhD looking at a genome assemblies of highly nutrient efficient Australian plants. I'm wanting to compare the protein sequences of different nutrient transporters with other species currently sequenced. I'm wondering if there is a pipeline that automates something similar to below, or maybe if there are better ways to get a measure of how divergent a protein family is compared to other species.
- BLAST search in NCBI database for desired protein family (can set similarity level e.g. 80%). Based on protein name.
- Record the number of protein paralogs and paralog similarity for species with their whole genome sequenced. Identify motifs for all the protein sequences (using MEME).
- Generate several consensus sequences for the protein sequences based on high similarity between proteins. This reduces the number of queries I have to run while still maintaining diversity to match against any protein sequences in my genome assembly that could be highly divergent.
- BLAST search consensus sequences in my genome assembly (allowing matches to multiple loci).
- BLAST a number of the top hits at each hit loci in my genome assembly back into NCBI to confirm they are orthologs of the desired protein family.
- Compare similarity, number, and conservation of motifs in extracted/translated protein sequences from my genome assembly to protein sequences of other species. Either lots of figures or quantitive values summarising all these together.