Retrieve subsequences from a list of gene sequences
0
0
Entering edit mode
8 weeks ago
keki • 0

Hello everyone!

I have a multifasta file with a list of several orthologs from a phylum of bacteria. They have in common a big central domain and they differ in their N and C-terminal extensions, in which they don't have any recognizable motif or domain. I'd like to find a way to extract these extensions or tails in a fast way (I wouldn't like to look at the sequences one by one and cut these subsequences) in order to align them. I imagine I need the coordinates of the core central domain, but I'm not sure how to retrieve them for each gene sequence from a multifasta file.

Thanks in advance :)

fasta • 294 views
0
Entering edit mode

If central domain is recognizable by blast then you can basically do an inverse of the "hit" to retrieve the sequences from a blast database of these using blastdbcmd.

Is the domain in a relatively fixed location and you have GenBank accessions numbers? If so you could use Entrezdirect to fetch sequence like

\$ efetch -db nuccore -id NC_000913 -seq_start 240 -seq_stop 257 -format fasta
>NC_000913.3:240-257 Escherichia coli str. K-12 substr. MG1655, complete genome
TAACGGTGCGGGCTGACG

0
Entering edit mode

I would clearly start with a Multiple Sequence Alignment and then parse this output. I am not very familiar with the tools (e.g. ClustalO) or the respective output formats, but orthologs are what these tools were written for.

1
Entering edit mode

While this may be one approach it may turn out to be tricky. We don't know how long the sequences are. If this domain is relatively small compared to the rest of the tails then MSA may not help. The coordinates can get complicated in aligned fasta.