The setup: I have a large number of sequences from a viral pathogen and the associated epidemiological data, collected during a major disease outbreak. I deeply suspect that the varied epidemiology seen across the outbreak (case severity and outcomes, transmission rate, etc.) is the result of changes in the viral sequence.
The question: So, how do I best correlate these epidemiological data with sequence data? In the crudest sense, how can I point at a SNP and say "this is associated with more severe cases"?
I'm concerned about phylogenetic inertia, i.e. false correlations caused by evolutionary relationship. A given sequence change may correlate with increased fatality because it was fixed in the lineage that infected a weakened group of hosts.
Some characteristics which are technically non-heritable will behave as heritable, e.g. location.
Solutions I've considered:
Tools from GWAS studies or similar: apart from the possible overkill of using these on such a short genome, I don't know of any GWAS tools that deal with the inertia problem..
Comparative analysis with independent contrasts: would be the obvious choice if I was dealing with solely character data. I could hack an suitable dataset together, say by treating a SNP loci as a character, but it seems ugly. Also, the state of useful software here is not good.
Selection: will tell me what sites are being selected for but not what might be correlated with that selection.
Compare controls: is something I've done before, but in this case it seems that deciding what to control for is pre-emptively deciding what won't correlate.