Question

Creating a Contingency Table (Gene Absence Presence)

0

Entering edit mode

4.2 years ago

biohacker_tobe ▴ 80

I currently have a dataframe that states particular gene clusters within genomes, this is defined as a well-formatted tab-delimited file, which looks basically like the dataframe below (Example):

    Gene Cluster     Genome
    ---------------------------------------
    GCF3372      Streptomyces_hygroscopicus
    GCF3371      Streptomyces_xiamenensis

The general idea based on this I want to measure the occurrence of a GCF per Genome, as well as the co-occurrence of a given GCF with others in various genomes

What type of visualizations also would anyone suggest for this type of data? The most what I thought best would be the use of a heatmap for this case!

gene R microbes dataframe genome • 3.4k views

ADD COMMENT • link 3.1 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

Suitable data type can depend on how many genomes and how many gene clusters you have. Would a hash table or a dictionary work (Genome as key and GCFs as values)? I'd suggest Hash package in R or Dictionary in python

A binary matrix and Genomes as rows and GCFs as columns and two GCFs that co-occure in a genome would have value 1 for the row for that genome (The matrix would be sparse I assume). I'd suggest pandas package in python

ADD REPLY • link 4.2 years ago by Fatima ▴ 1000

0

Entering edit mode

Thanks for your input on this, I will try the revision of a binary matrix based on packages mentioned. I just want a binary matrix more or less so I can eventually plot a heatmap and conduct statistical analysis, even something simple such as chi squared to measure the proper correlation.

Thanks again @Fatima :D

ADD REPLY • link 4.2 years ago by biohacker_tobe ▴ 80

score 3 · Accepted Answer · 2020-02-02

3

Entering edit mode

4.2 years ago

stuart archer ▴ 140

There are a few options, you can convert to a 1/0 matrix using table and 'image' the matrix. The gplots library has a convenient heatmap.2 function that will also cluster the rows and columns.

I took some liberties with your data to include S.hygroscopicus twice so that clustering makes sense.

x <- data.frame(gc = c('GCF3372','GCF3450','GCF3371','GCF3371','GCF3371'), 
                strain = c('Streptomyces_hygroscopicus', 'Streptomyces_sp_Hm1069', 'Streptomyces_sp_MBT13', 'Streptomyces_xiamenensis','Streptomyces_hygroscopicus'))
y <- table(x)
image(y) # non-clustered
gplots::heatmap.2(y, margins=c(15,15)) # clustered
# for downstream stats; distances between GCs:
dGC <- dist(y)
# for downstream stats; distances between strains:
dStrains <- dist(t(y))

Often the margins settings and font size in heatmap.2 need tweaking depending on the setup of the device you are writing to, see: https://stackoverflow.com/questions/21427863/how-to-add-more-margin-to-a-heatmap-2-plot-with-the-png-device

Edit: there are various packages to make trees from distance matrices - if you do this you should emphasize that they do NOT necessarily represent evolutionary relationships.

ADD COMMENT • link 4.2 years ago by stuart archer ▴ 140

0

Entering edit mode

Hey @stuartarcher thanks for your input on this. This is exactly more or less what I was interested in seeing, sorry for such a novice question. I'm still getting familiar to how to manage this type of data. With this type of a heatmap, would my level of intensity in the heatmap be regulated by how absent or present a given cluster is in a genome or would this be just using

I wanted to follow up with another question now that you have mentioned the use of trees. I ran the BiG-SCAPE algorithm on my gene clusters to measure their phylogenetic relationship. This algorithm basically gives a measure of distance between gene clusters and is represented in a dataframe such as:

BGC1     BGC2    Distance
-----------------------------------------------
BGC31     BGC34     0.6
BGC34     BGC45     0.7
BGC34     BGC53     0.2
BGC53     BGC31     0.8

Would it be possible to construct a tree just based on this type of data? I have been able to create network visualizations from this data through Cytoscape but not possibly a tree. Any further suggestions?

Thanks once again fro your input :)

ADD REPLY • link 4.2 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

Hi, so:

would my level of intensity in the heatmap be regulated by how absent or present a given cluster is in a genome

This is just 0 or 1, present or absent as far as I understand your data.

You can construct trees, but they are not indicative of evolutionary relationships. To do this you'd need to align the core genome genes (eg using roary), and make a tree from that (e.g. using RAXML).

You can still make trees from this data if you really want to though:

x <- data.frame(gc = c('GCF1','GCF2','GCF3','GCF3','GCF3', 'GCF1', 'GCF2','GCF3') , 
                         strain = c('SA', 'SB', 'SC', 'SD','SA', 'SD', 'SE','SE'))
y <- table(x)
d <- dist(t(y))
hc <- hclust(d)
plot(hc)

You can make fancier trees using the ape package:

ape::plot.phylo(x = ape::as.phylo(hc), type='unrooted')

The distance values between strains can be imaged as a heatmap too using reshape 2 and ggplot2:

library(ggplot2)
m <- as.matrix(d)
distDF <- reshape2::melt(m)
colnames(distDF) <- c('strain1', 'strain2', 'dist')

# order the strains using the hclust object, to place similar strains (mostly) near each other
distDF$strain1 <- factor(distDF$strain1, levels = hc$labels[order(hc$order)])
distDF$strain2 <- factor(distDF$strain2, levels = hc$labels[order(hc$order)])

# plot:
ggplot(distDF) + geom_raster(aes(x=strain1, y=strain2, fill=dist)) + theme_bw()

ADD REPLY • link 4.2 years ago by stuart archer ▴ 140

0

Entering edit mode

@stuart

When I was talking about constructing a tree, it was based on these distances as I mentioned above. I was just curious that if I could actually do clulstering from this based on distances between BGC types, in this case I'm not interested in the strains themselves:

BGC1     BGC2    Distance
-----------------------------------------------
BGC31     BGC34     0.6
BGC34     BGC45     0.7
BGC34     BGC53     0.2
BGC53     BGC31     0.8

Thanks again so much.

ADD REPLY • link 4.2 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

@Stuart I have no way to thank you for your help on all of this, this is a big help for myself and to fine tune my project. I have been able to previously run an algorithm that measures the phylogenetic relationship between gene clusters, so the distances in this case are definitely based on how closely related these clusters are, for which I am able to actually measure the revolutionary relationships of these.

I was able to construct a binary table (1/0s) based upon the code you shared, getting something like this:

      S1 S2 S3 S4 S5 S6
-----------------------------
GCF1  0  0  1  1  0  0
GCF2  0  1  0  1  1  1 
GCF3  1  1  1  0  0  0  
GCF4  0  0  0  0  0  1

I wanted to do a statistical analysis for this, would it be possible to use something simple as a Chi Squared or a Pairwise analysis like Fisher's?

Thanks in advanced!

ADD REPLY • link 4.2 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

The following tutorial might help you to decide if Chi-square test is a good test for your data or not as it has two useful sections (What is the Chi-square test for? and What is the Chi-square test NOT for?)

https://www.ling.upenn.edu/~clight/chisquared.htm

I came across the following paper some years ago. They used Chi-square test and protein-protein interactions to predict the sub-cellular localization of proteins. I know your data is different but I thought it might give you some ideas.

χ 2-score section:

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-S10-S20#Sec10

ADD REPLY • link 4.2 years ago by Fatima ▴ 1000

0

Entering edit mode

OK I think I understand a bit better. You have two different questions: 1) Making trees out of pre-calculated distances between your clusters that you generated with BiG-SCAPE, which have nothing to do with the presence/absence data.

For this I made what I think is a similar long-format table of pretend dist values, with 26 gene clusters (a-z):

df = as.data.frame(expand.grid(as.factor(letters), as.factor(letters)))
colnames(df)=c('GC1','GC2')
# get rid of reciprocal comparisons, this needs columns to be ordered *factors* with identical levels:
df=df[as.integer(df$GC1) > as.integer(df$GC2),] 
df$dist=runif(nrow(df))

Now turn your long-format table into wide-format and convert to a dist object:

df_wide <- reshape2::dcast(df, formula = GC1~GC2, value.var = 'dist')
rownames(df_wide) = df_wide$GC1
m <- as.matrix(df_wide[,-1])
d <- as.dist(m)
hc <- hclust(d)
plot(hc)

Note: use as.dist() NOT dist(). If your "m" matrix isn't symmetrical then not all combinations of GCFs were in your table, this is required for as.dist()

And 2), you have a question about a statistical test for the presence/absence table of GCs in strains, but I'm not sure exactly what hypothesis you want to test?

ADD REPLY • link 4.2 years ago by stuart archer ▴ 140

0

Entering edit mode

Sorry I didn't mean to have a lack of clarification in my questions in this regards. But it is as you have mentioned:

1) Unfortunately I am not able to make my matrix symmetric, for which I am not able to complete the tree with the dendrogram. Thank you though for the assistance on this.

2) In this case I am just trying to study the co-occurrence between GC pairs in a genome, supporting it both with visualizations (heatmap and networks). Basically I'm trying to find associations (GC pairs present together) or dissociations (GC pairs which are present apart, or avoid each other). This is where I want to support with statistical area, the most I was considering was with a more pairwise analysis, or revising literature I thought it would be adequate to use phi or chi squared however I'm not sure if this is the correct approach.

ADD REPLY • link 4.2 years ago by biohacker_tobe ▴ 80

1

Entering edit mode

Not sure about exactly how you'd build a contigency table here for chi2 or fishers. How about permuting the data and seeing whether the observed distance values fall outside the distribution for distance values from a 1000 random permutations

y=matrix(c(sample(c(T,F), 100, T)), nrow=5, ncol = 20)
y=rbind(y,y[5,])  # artificially make GC5 identical to GC6, this dist should be well below expected 
colnames(y)=paste0('S',1:20)
rownames(y)=paste0('GC', 1:6)
out <- vegan::permatswap(y, times = 1000, burnin = 20000, thin = 500)
rd <- c()
for (i in 1:length(out[[3]])){
  d=dist(out[[3]][[i]])
  rd=c(rd, as.vector(d))
}
par(mfrow=c(2,1))
hist(rd, breaks=50)
hist(as.vector(dist(y)), breaks=50) # observe that the dist=0 value (GC5 vs GC6 is well below any in the random distribution

Note: rather than taking all 15,000 dist values as your expected distribution as I have done above, it may be better to extract the 1000 distance values specifically from the GC pair of interest from the permuted data, and treat this as your background. Have to think about that one. Hopefully you could convert this to a Z-score for the real distance value.

ADD REPLY • link 4.2 years ago by stuart archer ▴ 140

0

Entering edit mode

I will check this out. It's just in my case there are no particular GC pairs of interest, I just want to look at the overall behavior for this? Thanks again stuart for taking time to answer my questions :)

ADD REPLY • link 4.2 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

Hi I wanted to ask something about this heatmap you mentioned here, you mentioned I could do it on a matrix like this:

BGC1     BGC2    Distance
-----------------------------------------------
BGC31     BGC34     0.6
BGC34     BGC45     0.7
BGC34     BGC53     0.2
BGC53     BGC31     0.8

But can I convert this type of data into a table to be used accordingly, as it has three variables etc?

ADD REPLY • link 4.2 years ago by biohacker_tobe ▴ 80