Question

Statistical Test Between Gene Enrichments

0

Entering edit mode

10.4 years ago

Adrian Pelin ★ 2.6k

Hello,

So I have 2 lists of variants that affect different genes. If I do enrichment with KOG, I get basically the gene categories of the genes that are affected by the 2 variant lists.

Is there anyway to compare if the 2 variant lists affect different gene categories?

Adrian

vcf • 2.3k views

ADD COMMENT • link updated 10.4 years ago by Charles Warden 8.2k • written 10.4 years ago by Adrian Pelin ★ 2.6k

score 2 · Answer 1 · 2013-11-27

2

Entering edit mode

10.4 years ago

Charles Warden 8.2k

For a given category, you can do a Fisher's exact test for in-list1/out-list1 versus in-list2/out-list2. You can apply that comparison across all categories and apply a false discovery rate correction.

You can then check for overlap with your current category lists.

ADD COMMENT • link 10.4 years ago by Charles Warden 8.2k

0

Entering edit mode

That sounds great, thank you for your answer. Any advice on tools?

EDIT: Would this approach in R work?

Convictions <- matrix(c(2, 10, 15, 3),
       nrow = 2,
       dimnames =
       list(c("Dizygotic", "Monozygotic"),
            c("Convicted", "Not convicted"))) Convictions

fisher.test(Convictions, alternative = "less")

ADD REPLY • link 10.4 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Yes, that is the right strategy. You'll also need to use apply or a for loop to cycle through your categories (unless you are only interested in a small number)

ADD REPLY • link 10.4 years ago by Charles Warden 8.2k

0

Entering edit mode

Well, if I have column 1 being list of genes 1 and column 2 being list of genes 2, and than the different rows being the different functional categories of genes, than 1 test should be enough right?

I currently have these categoriez being screened for:

#KOG class      count   description
A       104     RNA processing and modification
B       56      Chromatin structure and dynamics
C       25      Energy production and conversion
D       120     Cell cycle control, cell division, chromosome partitioning
E       19      Amino acid transport and metabolism
F       19      Nucleotide transport and metabolism
G       40      Carbohydrate transport and metabolism
H       6       Coenzyme transport and metabolism
I       39      Lipid transport and metabolism
J       170     Translation, ribosomal structure and biogenesis
K       143     Transcription
L       114     Replication, recombination and repair
M       20      Cell wall/membrane/envelope biogenesis
N       1       Cell motility
O       168     Posttranslational modification, protein turnover, chaperones
P       16      Inorganic ion transport and metabolism
Q       10      Secondary metabolites biosynthesis, transport and catabolism
R       141     General function prediction only
S       56      Function unknown
T       85      Signal transduction mechanisms
U       111     Intracellular trafficking, secretion, and vesicular transport
V       8       Defense mechanisms
W       5       Extracellular structures
Y       9       Nuclear structure
Z       39      Cytoskeleton

ADD REPLY • link 10.4 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

It depends upon your question.

If you were asking about the different distributions, then yes. However, I don't think this is what you want. For example, you would get a significant result if your gene list was simply twice as large. LIkewise, the groups are probably not independent: I would bet "Chromatin structure and dynamics" has a lot of overlap with "transcription".

For example, lets say the results above are for list #1. You also need to know the total number of genes used for analysis (let's say that was 1000). For "RNA processing and modification", the counts for list # 1 would be 104 and 986 (1000 - 104). You would then need to know the total number of genes used for the analysis (let's call this X) in list #2 and the number genes used for functional enrichment analysis for list #2 (let's call this Y).

The comparison for this category is 104-to-896 versus X-to-(Y-X). You should do this for all classes A-Z, unless you know that only a couple of them were showed statistically significant enrichment for either list #1 or list #2. I would say a different proportion in list #1 versus list #2 doesn't matter if neither varies from the background frequency.

ADD REPLY • link 10.4 years ago by Charles Warden 8.2k

0

Entering edit mode

What if I develop a script, that rather than counting number of genes falling in each category of processes, it would count the number of Unique KOGs found in those genes.That way, it's fair game, because we are looking at number of unique KOGs.

Some KOGs fall into multiple classes, so I would discard those.

ADD REPLY • link 10.4 years ago by Adrian Pelin ★ 2.6k