Hi The_PyPanda ,
First a caveat. the information we most need in order to help guide you to a successful conclusion is not provided in this post. We don't know, for instance, if you have prioritized these variants, how you filtered them, what the criteria for doing so were if you did, and so forth. Without this information, we cannot really know if you will obtain reasonable results - we can only assume / hope these steps have been solid. But if they haven't the advice we give will be at best irrelevant, at worst lead you down a bad road. In addition, there may be a magic bullet contained in that info .. for example, we might be able to easily figure out why you are generating so many hits and then help you re-assess the approach.
Finally, you say expression data is needed, but then you tell us you have generated DE pathways (somehow) .. how did you do this? The answer to that question may well be the magic bullet described above.
OK, caveats aside, let's assume you are good to go. I have a few distinct suggestions here:
1. Consider approaches to manage the pathway results obtained.
I dont doubt that you've come up with many pathways and hits, but most of those will be marginal associations, and most of those pathways will relate to one another. So, it is worth asking the question, "how many coherent and non-redundant gene programs do these 1000+ pathways represent"? To do this, you could try to:
- before conducting pathway analysis, generate pathway similarity scores on the front end, then remove redundant pathways by eliminating one of each pair that has a similarity score > 0.9 or some threshold (there are hundreds of pathways that barely differ from one another)
- after conducting pathway analysis, trim the number of pathways to only those that remain significant after family-wise error control.
- using your own biological knowledge, manually curate your own pathway by including genes that past a defined criteria you have set, then test these pathways (if you dont trust the huge numbers of "significant results")
2. Consider approaches in the published literature
Many manuscripts have dealt with this problem, yet there is little agreement on how best to do it. Though not a rare disease (RA has a prevalance of ~1%), you could, for example, consider elements of the approach taken in this manuscript. As you've said, there are no hard and fast rules - these authors used an ensemble of many methods, then ordered variants by how many of the methods identify the gene/variant.
3. Use the published literature to reduce search space and decide which variants are most plausible
Because we dont know what the disease state is, I dont know off hand how rare this disease is or how well the pathophysiology is understood. Reading the published literature can help you decide
- if you think the genes nominated by your filtering process are reasonable
- if you think the pathways nominated by the gene results are reasonable
4. Adapt a framework used for a related goal
You could consider the type of logic used, for example, to develop and justify gene burden testing, then adapt it to your question
Thank you LauferVA,
Your advice has been useful and I now a lot to think about and read.
Also, I just wanted to clarify I do not have Differential expression data. More so I was stating that many tools unfortunately require this.
Can I ask do you have methodology paper references to the pathway similarity scores and family-wise error control applied to gene lists.
Thank you once again