Hi all,
I’m working on a gene family analysis and encountered an interesting situation that I’d appreciate your thoughts on.
I started with 8 query protein sequences from species A (all annotated as members of the same gene family) and used them to perform BLAST searches against the proteomes of species A, B, and C. This resulted in 47 hits from species A, 42 from species B, and 34 from species C. I then built a phylogenetic tree based on these 123 sequences, and as expected, the 8 original query sequences clustered into a distinct subtree comprising 31 sequences from across the three species.
Later, I discovered an additional copy in species A that shares the same gene family annotation but was not among the 47 BLAST hits. When I added this 9th sequence to the query set and re-ran the BLAST search, I still retrieved the same number of hits from species B and C, but species A now returned 49 sequences — indicating that the 9th sequence retrieved one new match (besides itself).
After rebuilding the phylogenetic tree with these 125 sequences, the 9 query sequences (now 9) again formed a distinct subtree, which now includes 33 sequences — the 2 additional ones being the newly added sequence and its unique BLAST hit.
Here’s the part I’m trying to understand:
- If the 9th sequence is so divergent that it wasn’t picked up by BLAST when using the original 8 queries, why does it still cluster within the same gene family subtree in the phylogenetic tree?
- Shouldn’t its divergence have caused it to fall outside that subtree?
It seems contradictory that the 9th sequence is both divergent enough to escape detection in a BLAST search, yet still similar enough to cluster tightly with the original family in a phylogenetic tree.
Thanks in advance!
Thank you for the response — your points make a lot of sense!
Yes, I did trim the alignment using trimAl with the
-automated1
option before building the tree.Regarding species divergence, species A diverged from species C and B approximately 35 and 45 million years ago, respectively.
I also checked the tree, and indeed, the branch leading to the two additional sequences is indeed longerthan others in the subclade (if I understand the tree correctly).
Thanks again for your insights — very helpful!