Entering edit mode
3 months ago
keki • 0
I have a multifasta file with a list of several orthologs from a phylum of bacteria. They have in common a big central domain and they differ in their N and C-terminal extensions, in which they don't have any recognizable motif or domain. I'd like to find a way to extract these extensions or tails in a fast way (I wouldn't like to look at the sequences one by one and cut these subsequences) in order to align them. I imagine I need the coordinates of the core central domain, but I'm not sure how to retrieve them for each gene sequence from a multifasta file.
Thanks in advance :)
If central domain is recognizable by blast then you can basically do an inverse of the "hit" to retrieve the sequences from a blast database of these using
Is the domain in a relatively fixed location and you have GenBank accessions numbers? If so you could use Entrezdirect to fetch sequence like
I would clearly start with a Multiple Sequence Alignment and then parse this output. I am not very familiar with the tools (e.g. ClustalO) or the respective output formats, but orthologs are what these tools were written for.
While this may be one approach it may turn out to be tricky. We don't know how long the sequences are. If this domain is relatively small compared to the rest of the tails then MSA may not help. The coordinates can get complicated in aligned fasta.