I am interested in mapping host responses to viral infection against a phylogenetic tree. In theory, the responses should cluster in unity with the tree. I have a dozen or so isolates that have a fairly well resolved phylogeny, but it is using nearly the whole genome when performing the MSA.
My concern is that I am interested in measuring responses that are likely driven by a specific subset of genes and not the whole genome. A previous attempt using fewer isolates and broader conditions failed to see host response correlate with whole genome alignments.
My concern is that the inability to map host responses against the phylogeny of my isolates isn't due to the lack of such a relationship, but that by using a phylogeny built from nearly whole genome sequences, there is the risk that distances between key viral genes may be masked by global distances. That is, two isolates may be very similar in gene X, but have significant distances elsewhere. If gene x is involved in host response, then these seemingly distant viruses could be very similar when you measure the activity of gene X. This degrades the linkage between phylogeny and host response clustering.
With these viruses, many of the host modulating proteins have been very well characterized. I can readily identify which viral proteins will impact the host protein I want to measure, for all of the host proteins of interest.
Would it be better to use the whole genome phologeny, or build a new one using a putative list of targets? I'm not sure that it makes sense to cluster responses driven by a few viral genes against a phologeny built with many viral genes.