Question: How to calculate overlap of peptides between different categories to create Ven diagram
ishackm90 wrote:

Hi all,

I have the following dataset:

``````  ï..TGEClass.known         TGEClass.uknown
1             GVVEVTHDLQK             GVVEVTHDLQK
3       SALQSINEWAAQTTDGK       SALQSINEWAAQTTDGK
4  AVLSAEQLRDEEVHAGLGELLR  AVLSAEQLRDEEVHAGLGELL
``````

I would like to calculate please the number of peptides that are present in both categories and those that are not.

I have tried to use the Venn count function from limma but that only accepts numerical values:

``````a <- vennCounts(c3)
a
hw hm hr Counts
[1,]  0  0  0    113
[2,]  0  0  1     18
[3,]  0  1  0      8
[4,]  0  1  1      8
[5,]  1  0  0     12
[6,]  1  0  1      8
[7,]  1  1  0     11
[8,]  1  1  1     22
``````

How I can convert my peptide dataset like that dataset above so that I can make a Venn diagram. I have researched everywhere I can but still failed to find the solution.

I would really appreciate it if someone could help me solve this problem.

Many Thanks,

Ishack

SMK1.9k wrote:

Hi Ishack,

Try this:

``````df <-
data.frame(
TGEClass.known = c(
"GVVEVTHDLQK",
"SALQSINEWAAQTTDGK",
"AVLSAEQLRDEEVHAGLGELLR"
),
TGEClass.uknown = c(
"GVVEVTHDLQK",
"SALQSINEWAAQTTDGK",
"AVLSAEQLRDEEVHAGLGELL"
)
)

# Present in both TGEClass.known and TGEClass.uknown
length(intersect(df\$TGEClass.known, df\$TGEClass.uknown))

# TGEClass.known only
length(setdiff(df\$TGEClass.known, df\$TGEClass.uknown))

# TGEClass.uknown only
length(setdiff(df\$TGEClass.uknown, df\$TGEClass.known))
``````

Hi SMK, Thanks very much for your answer but how can I get a table like this automatically, it is quite long to do it manually?

``````hw hm hr Counts
[1,]  0  0  0    113
[2,]  0  0  1     18
[3,]  0  1  0      8
[4,]  0  1  1      8
[5,]  1  0  0     12
[6,]  1  0  1      8
[7,]  1  1  0     11
[8,]  1  1  1     22
``````

What are `hw`, `hm`, and `hr`?

Sorry those are meant to say TGEClass.uknown and TGEClass known. Please ignore the hw, hm and hr, I want table like that for TGEClass known and TGEClass unknown

Perhaps:

``````> df.venn <- data.frame(
+   TGEClass.known = c(1, 1, 0),
+   TGEClass.unknown = c(1, 0, 1),
+   Counts = c(length(
+     intersect(df\$TGEClass.known, df\$TGEClass.uknown)
+   ), length(
+     setdiff(df\$TGEClass.known, df\$TGEClass.uknown)
+   ), length(
+     setdiff(df\$TGEClass.uknown, df\$TGEClass.known)
+   ))
+ )
> df.venn
TGEClass.known TGEClass.unknown Counts
1              1                1      3
2              1                0      1
3              0                1      1
> as.matrix(df.venn)
TGEClass.known TGEClass.unknown Counts
[1,]              1                1      3
[2,]              1                0      1
[3,]              0                1      1
``````

Hi SMK thanks a lot thats what was look for. Just one final question if you don't mind.

I have a lot of data frames like the one above but each one has a different number of categories and also different categories, would it be possible to intersect and setdif between all the different columns automatically?

Got an idea from the function: `venn`, here demonstrating 2 sets and 3 sets:

``````> library(gplots)
> # Two sets
> df1 <-
+   data.frame(
+     TGEClass.known = c(
+       "GVVEVTHDLQK",
+       "SALQSINEWAAQTTDGK",
+       "AVLSAEQLRDEEVHAGLGELLR"
+     ),
+     TGEClass.uknown = c(
+       "GVVEVTHDLQK",
+       "SALQSINEWAAQTTDGK",
+       "AVLSAEQLRDEEVHAGLGELL"
+     )
+   )
> venn.tab1 <- venn(as.list(df1), show.plot = FALSE)
> attr(venn.tab1, "intersections") <- NULL
> attr(venn.tab1, "class") <- NULL
> print(venn.tab1)
num TGEClass.known TGEClass.uknown
00   0              0               0
01   1              0               1
10   1              1               0
11   3              1               1
> # Three sets
> df2 <-
+   data.frame(
+     TGEClass.set1 = c(
+       "GVVEVTHDLQK",
+       "SALQSINEWAAQTTDGK",
+       "AVLSAEQLRDEEVHAGLGELLR"
+     ),
+     TGEClass.set2 = c(
+       "GVVEVTHDLQK",
+       "SALQSINEWAAQTTDGK",
+       "AVLSAEQLRDEEVHAGLGELL"
+     ),
+     TGEClass.set3 = c(
+       "GVVEVTHDLQK",
+       "SALQSINEWAAQTTDGKK",
+       "AVLSAEQLRDEEVHAGLGELLR"
+     )
+   )
> venn.tab2 <- venn(as.list(df2), show.plot = FALSE)
> attr(venn.tab2, "intersections") <- NULL
> attr(venn.tab2, "class") <- NULL
> print(venn.tab2)
num TGEClass.set1 TGEClass.set2 TGEClass.set3
000   0             0             0             0
001   1             0             0             1
010   1             0             1             0
011   0             0             1             1
100   0             1             0             0
101   1             1             0             1
110   1             1             1             0
111   2             1             1             1
``````

Hi SMK, Unfortunately, I found just now that I can't do a Venn diagram for more than 5 categories.

Can you help me create a df that looks like this please?

``````TGE-Class     Count
T1              1
T2              1
Both            6
``````

Thanks very much

``````> library(gplots)
> df <-
+   data.frame(
+     T1 = c(
+       "GVVEVTHDLQK",
+       "SALQSINEWAAQTTDGK",
+       "SALQSINEWAAQTTDGLL",
+       "SALQSINEWAAQTTDGTT",
+       "SALQSINEWAAQTTDGQQ",
+       "AVLSAEQLRDEEVHAGLGELLR"
+     ),
+     T2 = c(
+       "GVVEVTHDLQK",
+       "SALQSINEWAAQTTDGK",
+       "SALQSINEWAAQTTDGLL",
+       "SALQSINEWAAQTTDGTT",
+       "SALQSINEWAAQTTDGQQ",
+       "AVLSAEQLRDEEVHAGLGELL"
+     )
+   )
> venn.tab <- venn(as.list(df), show.plot = FALSE)
> t(t(unlist(lapply(attr(venn.tab, "intersections"), length))))
[,1]
T1       1
T2       1
T1:T2    6
``````

Hi SMK,

Thanks very much for your quick response, I have been trying all day to fix this. You are a life saver!

Hi SMK, sorry for the lateness, is there a way to see the number of unique peptides from each category when there are blanks in columns, please?

the `length code` sees the blank cells as unique peptides, unfortunately.

Hi ishackm,

You can remove the empty element in list before you use `venn`:

``````l <- as.list(df)
l <- lapply(l, function(x) { x[!x == ""] })
venn.tab <- venn(l, show.plot = FALSE)
``````

Hi SMK , thank you again for your quick response. Much Appreciated.

zx87549.3k wrote:

Convert to TRUE/FALSE, then use limma venn counts:

``````# example data
df <-data.frame(
TGEClass.known = c(
"GVVEVTHDLQK",
"SALQSINEWAAQTTDGK",
"AVLSAEQLRDEEVHAGLGELLR"
),
TGEClass.uknown = c(
"GVVEVTHDLQK",
"SALQSINEWAAQTTDGK",
"AVLSAEQLRDEEVHAGLGELL"
), stringsAsFactors = FALSE
)

library(data.table)

x <- dcast(cbind(stack(as.list(df)), x = TRUE),
values ~ ind,
value.var = "x",
fill = FALSE)[, -1]

limma::vennCounts(x)
#   TGEClass.known TGEClass.uknown Counts
# 1              0               0      0
# 2              0               1      1
# 3              1               0      1
# 4              1               1      3

limma::vennDiagram(x)
``````

Hi, I ran the code you gave me but it is giving me an error:

``````    df = read.csv("FN1.csv")
FN1 = as.vector(df)

library(data.table)

x <- dcast(cbind(stack(as.list(FN1)), x = TRUE),
values ~ ind,
value.var = "x",
fill = FALSE)[, -1]
limma::vennCounts

(x)

Error in stack.default(as.list(FN1)) :
at least one vector element is required
``````

What im I doing wrong here please?

You need to share your example CSV: `FN1.csv`, so that we can reproduce the problem.

this is the csv I am using:

``````T2  T3
QHDMGHMMR   QHDMGHMMR
RPGGEPSPEGTTGQSYNQYSQR  RPGGEPSPEGTTGQSYNQYSQR
KTDELPQLVTLPHPNLHGPEILDVPSTVQK  KTDELPQLVTLPHPNLHGPEILDVPSTVQK
HRPRPYPPNVGEEIQIGHIPR   HRPRPYPPNVGEEIQIGHIPR
QHDMGHMMR   QHDMGHMMR
DQCIVDDITYNVNDTFHK  DQCIVDDITYNVNDTFHK
YYRITYGETGGNSPVQEFTVPGSK    YYRITYGETGGNSPVQEFTVPGSK
``````

The code:

``````test = read.csv("test.csv", stringsAsFactors = FALSE)

library(gplots)
# example data

library(data.table)

x <- dcast(cbind(stack(as.list(df2)), x = TRUE),
values ~ ind,
value.var = "x",
fill = FALSE)[, -1]

limma::vennCounts(x)
limma::vennDiagram(x)
``````

The error:

``````Aggregation function missing: defaulting to length
Error in vapply(indices, fun, .default) : values must be type 'logical',
but FUN(X[]) result is type 'integer'
``````

How can I fix this please?

Yes, as the your columns overlap fully TRUE/FALSE is not working, replace TRUE/FALSE with 1/0 in dcast, see below example:

``````# example data
T2  T3
QHDMGHMMR   QHDMGHMMR
RPGGEPSPEGTTGQSYNQYSQR  RPGGEPSPEGTTGQSYNQYSQR
KTDELPQLVTLPHPNLHGPEILDVPSTVQK  KTDELPQLVTLPHPNLHGPEILDVPSTVQK
HRPRPYPPNVGEEIQIGHIPR   HRPRPYPPNVGEEIQIGHIPR
QHDMGHMMR   QHDMGHMMR
DQCIVDDITYNVNDTFHK  DQCIVDDITYNVNDTFHK
YYRITYGETGGNSPVQEFTVPGSK    YYRITYGETGGNSPVQEFTVPGSK", stringsAsFactors = FALSE, header = TRUE)

library(data.table)

x <- dcast(cbind(stack(as.list(df)), x = 1),
values ~ ind,
value.var = "x",
fill = 0)[, -1]

limma::vennCounts(x)

#   T2 T3 Counts
# 1  0  0      0
# 2  0  1      0
# 3  1  0      0
# 4  1  1      6
# attr(,"class")
#  "VennCounts"
``````

Thanks very much for your quick response

Hi, Unfortunately, I found just now that I can't do a Venn diagram for more than 5 categories.

Can you help me create a df that looks like this please?

``````TGE-Class     Count
T1              1
T2              1
Both            6
``````

Thanks very much

lieven.sterck7.9k wrote:

if you are looking for exact mactches (so no peptide can be subset of another) you can use your lists as such as input for DrawVenn . It's an online tool for drawing venn diagrams