How Can I Generate A Venn Diagram In R For A Given Csv File

Entering edit mode

12.2 years ago

Payal ▴ 40

Hello,

AIM is to check common names between A and B or A and multiple variables

I have a CSV file which looks like this...

Names                    A             B           C          D   Sum
NPAS4                    1             1           1                3
FAM120C                  1             1                            2
ROBO4                                              1                1
PRODH                    1             1           1                3
AQP4                                               1          1     2

Now what i want to create a 3 Venn diagrams

Venn diagram to see common between all 4 possible attributes E.g. (A, B, C, D)
Venn diagram to see common between all 3 attributes E.g. [(A,B,C), (B,C,D).......]
Venn diagram to see common between all 2 attributes E.g. [(A,B), (B,C).......]

The code written with the help of the website (http://www.ats.ucla.edu/stat/r/faq/venn.htm)

>library(limma)
>hsb2 <- read.csv("my.csv")
>attach(hsb2)
>hw <- (A = 1)
>hm <- (B = 1)
>hr <- (C = 1)
>c3 <- cbind(hw, hm, hr)
>a <- vennCounts(c3)
>a

output looks like this..

      hw hm hr Counts
[1,]  0  0  0      0
[2,]  0  0  1      0
[3,]  0  1  0      0
[4,]  0  1  1      0
[5,]  1  0  0      0
[6,]  1  0  1      0
[7,]  1  1  0      0
[8,]  1  1  1      1
attr(,"class")
[1] "VennCounts

Why its not giving the actual count like in the csv file (sum)?

Please Help!

r statistics • 27k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 12.2 years ago by Payal ▴ 40

Entering edit mode

Dear Payal, If you want venn diagram of any kind the '1' s and '0' might be confusing to create them. instead if you can convert your csv file such that '1' s replaced with respective id e.g NPAS4, FAM120C etc. and delete all '0' s; I can help you get what you want.

ADD REPLY • link 12.2 years ago by Naren ★ 1.0k

Entering edit mode

12.2 years ago

Alex Reynolds 36k

Because it's not always possible to use a Venn diagram (a circular one that could be made in R) to show overlaps between three or four sets, I'll suggest something a little different.

I came up with something I call an "Eulergrid" which shows a bar graph, where each bar is an element in the power set of intersected sets, and a grid of overlap cases underneath (e.g., for three sets: A, B, C, A ∩ B, B ∩ C, A ∩ C, A ∩ B ∩ C).

The bar graph shows the overlap cardinalities between set intersections contained in the power set. The grid shows the intersection between one and more sets, and is aligned to the value shown in the bar graph column. The bar graph is sorted by overlap cardinality, presented from left to right, from least to greatest cardinality. (I leave out visualizing the empty set, although strictly speaking this is also a valid subset.)

While an Eulergrid is admittedly less intuitive to read than a circular Venn diagram, it can always show all true overlaps between all the sets, and without adding distortion or visual errors from "impossible" Venn overlaps.

The R script used to make Eulergrids will scale up to however many sets you need to show intersections for, but it will create an exponentially wider figure as the total number of permutations of intersections increase as a power of 2 (three sets have eight power set subsets, intersections of four sets have sixteen subsets; five sets have thirty-two subsets, etc.).

To demonstrate, here's an example of what an Eulergrid figure looks like:

Eulergrid

The green denotes the count for that subset. Yellow coloring, in the context of this figure, represents cell-specific cardinality, i.e. the counts that are unique to a single cell type or dataset.

As a way to read this, for example, 42% of the total element overlaps over these five cells types involve SKNSH in some way. Of all those overlaps, roughly half can be assigned to SKNSH alone.

Here's the R code for plotEulergrid.R:

	plotEulergrid <- function (plotTitle, offCellColor, onCellColor, setNames, setCardinalities, setTotal, setTotalWithout, outputFilename, showWholeSets, ctsCardinalities)
	{
	library(grDevices)
	library(gplots)

	showWholeSets <- as.numeric(showWholeSets)
	if (showWholeSets == 1) wholeColors <- c("red", "green", "blue", "darkgoldenrod2", "purple", "grey50", "gold3")

	setTotal <- as.numeric(setTotal)
	unadjSetTotal <- setTotal
	if (setTotal %% 2 == 1) setTotal <- setTotal + 1
	setTotalAnnotation <- "unique footprints"

	plotTitle <- gsub("__", "\ ", plotTitle)

	setNameList <- strsplit(setNames, "\,")
	lenNames <- length(setNameList[[1]])
	if (showWholeSets == 1) wholeSetInterval <- 1 / lenNames

	resolution <- 150
	outputFileWidth <- 8 * (lenNames / 2)
	outputFileHeight <- 12

	filenameComponents <- strsplit(outputFilename, "\.")
	if (filenameComponents[[1]][length(filenameComponents[[1]])] == "ps") {
	postscript(outputFilename, height = outputFileHeight, width = outputFileWidth, paper = 'special', horizontal = F)
	} else {
	bitmap(file=outputFilename, type="png256", width=outputFileWidth, height=outputFileHeight, res=resolution)
	}

	setCardsList <- strsplit(setCardinalities, "\,")
	setCardinalitiesList <- as.numeric(setCardsList[[1]])
	maxCardinality <- max(setCardinalitiesList)
	roundedMaxCardinality <- signif(maxCardinality, digits=4) + 5000

	ctsCardsList <- strsplit(ctsCardinalities, "\,")
	ctsCardinalitiesShortList <- as.numeric(ctsCardsList[[1]])
	if (lenNames == length(ctsCardsList[[1]])) showCtsCardinalities <- TRUE;
	if (showCtsCardinalities) {
	ctsCardinalitiesList <- setCardinalitiesList
	for (elementIndex in 1:length(setCardsList[[1]])) {
	if (elementIndex <= lenNames) ctsCardinalitiesList[elementIndex] <- ctsCardinalitiesShortList[elementIndex]
	else ctsCardinalitiesList[elementIndex] <- 0
	}
	}

	setIntersectionList <- NULL
	for (setIndex in 1:lenNames) {
	subset <- subsets(setNameList[[1]], setIndex)
	for (subsetIndex in 1:nrow(subset)) {
	str <- ""
	for (elementIndex in 1:ncol(subset)) {
	if (elementIndex == 1)
	str <- subset[subsetIndex, elementIndex]
	else if ((elementIndex > 1) && (elementIndex <= ncol(subset)))
	str <- paste(str, "^", subset[subsetIndex, elementIndex], sep=" ")
	}
	setIntersectionList <- append(setIntersectionList, str)
	}
	}

	boundSet <- cbind(setIntersectionList, setCardinalitiesList)
	boundSetPermutation <- order(as.numeric(boundSet[,2]), decreasing=F)
	sortedBoundSet <- boundSet[boundSetPermutation,]
	if (showCtsCardinalities) {
	ctsSet <- cbind(setIntersectionList, ctsCardinalitiesList)
	sortedCtsSet <- ctsSet[boundSetPermutation,]
	}
	lenSubsets <- length(setIntersectionList)

	# in grid, setNameList is the y-axis and boundSet\|sortedBoundSet is the x-axis
	# in bars, height is value of boundSet\|sortedBoundSet, proportional to setTotal value

	gridTop <- -0.2
	gridBottom <- -1.0
	gridLeft <- 0
	gridRight <- 1

	barTop <- 2.0
	barBottom <- 0
	barLeft <- 0
	barRight <- 1

	titleBottom <- barTop
	titleTop <- titleBottom + 0.5

	plotBottom <- gridBottom - 2.0
	plotTop <- titleTop
	plotLeft <- gridLeft - 0.2
	plotRight <- gridRight + 0.2

	allPlot <- plot(range(plotLeft, plotRight), range(plotBottom, plotTop), type="n", axes=F, main="", xlab="", ylab="", cex.main=1.0, mar=c(1,1,1,1))
	allPlotTitleText <- text(0.5, titleBottom + 0.25, labels=plotTitle, adj=0.5, font=2, cex=1.5, col="black")
	barPlotRect <- rect(barLeft, barBottom, barRight, barTop, col="gray80", border=NA)

	setTotal <- roundedMaxCardinality

	for (divIndex in 1:setTotal) {
	div <- divIndex * ((barTop - barBottom) / setTotal)
	if (divIndex == 1) firstDiv <- div/2

	x1 <- c(barLeft, barRight)
	x2 <- c(barBottom + div, barBottom + div)
	if (divIndex %% round(setTotal*0.333/2) == 0) horizGridPlotLines <- lines(x1, x2, col="white", lwd=0.5)
	}

	for (divIndex in 1:lenSubsets) {
	div <- divIndex * ((gridRight - gridLeft) / lenSubsets)
	if (divIndex == 1) firstDiv <- div/2

	x1 <- c(barLeft + div, barLeft + div)
	x2 <- c(barBottom, barTop)
	#vertBarPlotLines <- lines(x1, x2, col="white", lwd=0.5)
	}

	# too simplistic, need to apply inclusion-exclusion to get total elements that are unique to a "whole set"

	if (showWholeSets == 1) {
	wholeMatrix <- matrix(nrow=lenNames, ncol=2)
	for (divIndex in 1:lenNames) {
	if (divIndex == 1) prevDiv <- barLeft
	else prevDiv <- (divIndex - 1) * ((barRight - barLeft) / lenNames)
	div <- divIndex * ((barRight - barLeft) / lenNames)

	wholeSetTotal <- 0
	for (subsetIndex in 1:lenSubsets) {
	subsetLabel <- sortedBoundSet[subsetIndex, 1]
	subsetComponents <- strsplit(subsetLabel, "\^")
	#print (paste(subsetLabel, length(subsetComponents[[1]]), sep=" "))
	setLabel <- setNameList[[1]][divIndex]
	if (length(grep(paste(setLabel," ",sep=""), paste(subsetLabel," ",sep=""))) > 0) {
	if (length(subsetComponents[[1]]) == 1) wholeSetTotal <- wholeSetTotal + as.numeric(sortedBoundSet[subsetIndex,2])
	else {
	if (length(subsetComponents[[1]]) %% 2 == 0) wholeSetTotal <- wholeSetTotal + as.numeric(sortedBoundSet[subsetIndex,2])
	else wholeSetTotal <- wholeSetTotal - as.numeric(sortedBoundSet[subsetIndex,2])
	}
	}
	}
	#quit("yes")
	#print (paste(prevDiv, div, setNameList[[1]][divIndex], wholeSetTotal, unadjSetTotal, wholeSetTotal/unadjSetTotal, sep=" "))

	wholeMatrix[divIndex, 1] = divIndex
	wholeMatrix[divIndex, 2] = wholeSetTotal / unadjSetTotal
	}
	reorderedWholeMatrix <- wholeMatrix[order(as.numeric(wholeMatrix[,2]), decreasing=F),]
	print (reorderedWholeMatrix)
	for (divIndex in 1:lenNames) {
	if (divIndex == 1) prevDiv <- barLeft
	else prevDiv <- (divIndex - 1) * ((barRight - barLeft) / lenNames)
	div <- divIndex * ((barRight - barLeft) / lenNames)
	xL <- prevDiv
	xR <- div
	yB <- barBottom
	yT <- reorderedWholeMatrix[divIndex,2] * barTop
	wholeSetColor <- wholeColors[reorderedWholeMatrix[divIndex,1]]
	print (paste("color[", reorderedWholeMatrix[divIndex,1], "] -", wholeSetColor, sep=" "))
	wholeSetRect <- rect(xL, yB, xR, yT, col=wholeColors[reorderedWholeMatrix[divIndex,1]], border="grey90")
	}
	}

	for (nameIndex in 1:lenNames) {
	nameDiv <- nameIndex * (gridTop - gridBottom) / lenNames
	if (nameIndex == 1) firstNameDiv <- nameDiv
	for (subsetIndex in 1:lenSubsets) {
	subsetDiv <- subsetIndex * (gridRight - gridLeft) / lenSubsets
	if (subsetIndex == 1) firstSubsetDiv <- subsetDiv
	subsetLabel <- sortedBoundSet[subsetIndex,1]
	nameLabel <- setNameList[[1]][nameIndex]

	# grid

	cellColor <- offCellColor
	print (paste(nameIndex, subsetIndex, nameLabel, subsetLabel, sep=" "))
	if (length(grep(paste(nameLabel," ",sep=""), paste(subsetLabel," ",sep=""))) > 0) cellColor <- onCellColor

	xL <- gridLeft + subsetDiv - firstSubsetDiv
	xR <- xL + firstSubsetDiv
	yB <- gridBottom + nameDiv - firstNameDiv
	yT <- yB + firstNameDiv

	setRect <- rect(xL, yB, xR, yT, col=cellColor, border=NA)

	# bar

	cellColor <- onCellColor
	subsetValue <- as.numeric(sortedBoundSet[subsetIndex,2])
	yB <- barBottom
	yT <- yB + barTop * (subsetValue / setTotal)

	setRect <- rect(xL, yB, xR, yT, col=cellColor, border="white", lwd=0.75)

	if (showCtsCardinalities) {
	ctsCellColor <- "yellow"
	ctsValue <- as.numeric(sortedCtsSet[subsetIndex,2])
	print (subsetIndex)
	print (ctsValue)
	if (ctsValue != 0) {
	yB <- barBottom
	yT <- yB + barTop * (ctsValue / setTotal)

	ctsRect <- rect(xL, yB, xR, yT, col=ctsCellColor, border="white", lwd=0.75)
	}
	}
	}
	}

	for (divIndex in 1:lenNames) {
	div <- divIndex * ((gridTop - gridBottom) / lenNames)
	if (divIndex == 1) firstDiv <- div/2

	x1 <- c(gridLeft, gridRight)
	x2 <- c(gridBottom + div, gridBottom + div)
	horizGridPlotLines <- lines(x1, x2, col="white", lwd=0.5)

	if (showWholeSets == 1) {
	horizGridPlotLabelLeft <- text(gridLeft - 0.05*(2/lenNames), (gridBottom + div) - firstDiv, labels=setNameList[[1]][divIndex], adj=1, cex=0.8, font=2, col=wholeColors[divIndex])
	horizGridPlotLabelRight <- text(gridRight + 0.05*(2/lenNames), (gridBottom + div) - firstDiv, labels=setNameList[[1]][divIndex], adj=0, cex=0.8, font=2, col=wholeColors[divIndex])
	}
	else {
	horizGridPlotLabelLeft <- text(gridLeft - 0.05*(2/lenNames), (gridBottom + div) - firstDiv, labels=setNameList[[1]][divIndex], adj=1, cex=0.8, font=2)
	horizGridPlotLabelRight <- text(gridRight + 0.05*(2/lenNames), (gridBottom + div) - firstDiv, labels=setNameList[[1]][divIndex], adj=0, cex=0.8, font=2)
	}
	}

	for (divIndex in 1:lenSubsets) {
	div <- divIndex * ((gridRight - gridLeft) / lenSubsets)
	if (divIndex == 1) firstDiv <- div/2

	x1 <- c(gridLeft + div, gridLeft + div)
	x2 <- c(gridBottom, gridTop)
	vertGridPlotLines <- lines(x1, x2, col="white", lwd=0.5)
	vertGridPlotLabel <- text(gridLeft + div - firstDiv, gridBottom - 0.1, labels=sortedBoundSet[divIndex,1], adj=0, cex=0.8*(4/lenNames), font=2, srt=270)
	}

	horizBarPlotLabel <- text(barLeft - 0.05(2/lenNames), seq(barBottom,barTop,0.333), labels=as.character(round(setTotalseq(barBottom,barTop,0.3333333)/2)), adj=1, cex=0.8, font=2)
	horizBarPlotTypeLabel <- text(barLeft - 0.085, (barTop - barBottom)/2.0, labels="fps count", adj=0.5, cex=0.8, font=2, srt=90)
	horizBarPlotPercentageLabel <- text(barRight + 0.05(2/lenNames), seq(barBottom,barTop,0.333), labels=as.character(signif((setTotal/unadjSetTotal)seq(barBottom,barTop,0.3333333)/2, digits=2)), adj=0, cex=0.8, font=2)
	horizBarPlotPercentageTypeLabel <- text(barRight + 0.075, (barTop - barBottom)/2.0, labels="fraction-of-total fps", adj=0.5, cex=0.8, font=2, srt=270)

	barPlotRect <- rect(barLeft, barBottom, barRight, barTop, col=NA, border="black")
	gridPlotRect <- rect(gridLeft, gridBottom, gridRight, gridTop, col=NA, border="black")
	dev.off()
	}

	subsets <- function(n, r) {
	if(is.numeric(n) & length(n) == 1) v <- 1:n else {
	v <- n
	n <- length(v)
	}
	subs <- function(n, r, v)
	if (r <= 0) NULL else
	if (r >= n) matrix(v[1:n], nrow = 1) else
	rbind(cbind(v[1], subs(n - 1, r - 1, v[-1])),subs(n - 1, r , v[-1]))
	subs(n, r, v)
	}

	#
	#
	#
	#
	# parse arguments
	#
	#
	#
	#
	args=(commandArgs())
	argsFlag=FALSE

	if(length(args)==0) {
	print ("Error: No arguments supplied!")
	quit("yes")
	} else {
	print(args)
	for(i in 1:length(args))
	{
	if (argsFlag)
	{
	eval(parse(text=args[[i]]))
	}
	if (! is.na(match("--args",args[i])))
	{
	argsFlag=TRUE
	}
	}
	}

	plotEulergrid(plotTitle, offCellColor, onCellColor, setNames, setCardinalities, setTotal, setTotalWithout, outputFilename, showWholeSets, ctsCounts)

view raw plotEulergrid.R hosted with ❤ by GitHub

Here's a Perl-based wrapper to this R script, called eulergrid.pl:

	#!/usr/bin/env perl

	use warnings;
	use strict;
	use Getopt::Long;

	# -------------------------------------------------------------------------------------------
	# options

	my ($plotTitle, $offCellColor, $onCellColor, $setNames, $setCardinalities, $setTotal, $setTotalWithout, $outputFilename, $showWholeSets, $rGraphScript, $ctsCounts);
	my $optResult = GetOptions ("plotTitle=s" => $plotTitle, "offCellColor=s" => $offCellColor, "onCellColor=s" => $onCellColor, "setNames=s" => $setNames, "setCardinalities=s" => $setCardinalities, "setTotal=s" => $setTotal, "setTotalWithout=s" => $setTotalWithout, "outputFilename=s" => $outputFilename, "showWholeSets=s" => $showWholeSets, "rGraphScript=s" => $rGraphScript, "ctsCounts=s" => $ctsCounts);

	if (!$plotTitle) { die "specify --plotTitle=str\n"; }
	if (!$offCellColor) { $offCellColor = "red"; }
	if (!$onCellColor) { $onCellColor = "green"; }
	if (!$setNames) { die "specify --setNames=a1,a2,a3,...,aN\n"; }
	if (!$setCardinalities) { die "specify --setCardinalities=c1,c2,c3,...,cN,c1^c2,c1^c3,...,c1^c2^c3^...^cN\n"; }
	if (!$setTotal) { die "specify --setTotal=n\n"; }
	if (!$setTotalWithout) { $setTotalWithout = -1; }
	if (!$outputFilename) { die "specify --outputFilename=str.png\|ps\n"; }
	if (!$ctsCounts) { die "specify --ctsCounts=cts1,cts2,...,ctsN\n"; }
	if (!$showWholeSets) { $showWholeSets = -1; } else { $showWholeSets = 1; }
	if (!$rGraphScript) { $rGraphScript = "/home/areynolds/proj/eulergrid/src/plotEulergrid.R"; }

	# -------------------------------------------------------------------------------------------
	# main

	my $rScriptSys = "R CMD BATCH --no-save --no-restore \"--args plotTitle=\\"$plotTitle\\" offCellColor=\\"$offCellColor\\" onCellColor=\\"$onCellColor\\" setNames=\\"$setNames\\" setCardinalities=\\"$setCardinalities\\" setTotal=\\"$setTotal\\" setTotalWithout=\\"$setTotalWithout\\" outputFilename=\\"$outputFilename\\" showWholeSets=\\"$showWholeSets\\" ctsCounts=\\"$ctsCounts\\"\" $rGraphScript runtime.log 2>&1";
	system ($rScriptSys) == 0 or die "R script failed $?";

view raw eulergrid.pl hosted with ❤ by GitHub

Here's an example of calling the Perl wrapper on the command line, which was used to make the figure shown above:

$ ./eulergrid.pl \
    --setNames=GM06990,HepG2,K562,SKNSH,TH1 \
    --plotTitle="Footprint__overlaps__for__multiple__cell__lines\n(FDR__0.001)" \
    --setCardinalities=212350,233552,270586,287731,240701,93351,64049,89860,110579,62852,96806,89476,62075,64644,90129,30893,51178,53416,29083,32041,51033,28922,28279,48629,27407,22805,23548,39400,22418,21029,17172 \
    --setTotal=689952 \
    --outputFilename=results/footprintOverlaps/overlaps.fdr0p001.112409.png \
    --offCellColor="gray80" \
    --onCellColor="springgreen4" \
    --ctsCounts=65897,97624,173336,150753,91965

The option --ctsCounts refers to the yellow coloring I describe up above, representing "cell-type-specific" counts.

Hopefully, this gives you some ideas or at least an understanding that Venn diagrams cannot always represent intersections between more than three sets (and sometimes not even between three sets).

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 12.2 years ago by Alex Reynolds 36k

Entering edit mode

This is awesome! I was going to suggest the R packages venneuler or VennDiagram from my past experience, but I plan to use your function now.

ADD REPLY • link 12.2 years ago by Josh Herr 5.8k

Entering edit mode

Impressive answer! I shall definitely give this a go as Venn diagrams are often not a satisfactory method of presentation; this gives a lot of detail.

ADD REPLY • link 12.2 years ago by Ian 6.1k

Entering edit mode

Hey Alex, really elegant! Could you please feed us with some sample data for your R script? I would like to play with this a bit. Thanks! Marcin

ADD REPLY • link 12.1 years ago by marcin.bazyliszek • 0

Entering edit mode

Thanks, Marcin. Please see the example provided to get a feel for how it works.

ADD REPLY • link 12.1 years ago by Alex Reynolds 36k

Entering edit mode

12.2 years ago

henryvuong ▴ 810

I like using Venny the most. It's very intuitive. I tried it with your example data.

example

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 12.2 years ago by henryvuong ▴ 810

Entering edit mode

It's very attractive, but note that it doesn't show counts in an area-proportional manner. For example, there are empty set intersections (with count 0) that have greater area than 2-hit set intersections. This is a problem with using Euler ("Venn") diagrams to present this type of data.

ADD REPLY • link 12.1 years ago by Alex Reynolds 36k

Entering edit mode

Sir how could we add generate more lists in the venny, because i am tried but it was not generated pls help me

ADD REPLY • link 6.3 years ago by ar.silambu92 • 0

Entering edit mode

As far as i know, it can handle only four lists.

ADD REPLY • link 4.5 years ago by Bioinformatician_in_trouble ▴ 30

Entering edit mode

12.2 years ago

zx8754 12k

Change following lines:

hw <- (A = 1)
hm <- (B = 1)
hr <- (C = 1)

to:

hw <- (A == 1)
hm <- (B == 1)
hr <- (C == 1)

ADD COMMENT • link 12.2 years ago by zx8754 12k