How to calculate overlap of peptides between different categories to create Ven diagram
3
0
Entering edit mode
5.4 years ago
ishackm ▴ 110

Hi all,

I have the following dataset:

  ï..TGEClass.known         TGEClass.uknown
1             GVVEVTHDLQK             GVVEVTHDLQK
2           LFYADHPFIFLVR           LFYADHPFIFLVR
3       SALQSINEWAAQTTDGK       SALQSINEWAAQTTDGK
4  AVLSAEQLRDEEVHAGLGELLR  AVLSAEQLRDEEVHAGLGELL

I would like to calculate please the number of peptides that are present in both categories and those that are not.

I have tried to use the Venn count function from limma but that only accepts numerical values:

a <- vennCounts(c3)
a
     hw hm hr Counts
[1,]  0  0  0    113
[2,]  0  0  1     18
[3,]  0  1  0      8
[4,]  0  1  1      8
[5,]  1  0  0     12
[6,]  1  0  1      8
[7,]  1  1  0     11
[8,]  1  1  1     22

How I can convert my peptide dataset like that dataset above so that I can make a Venn diagram. I have researched everywhere I can but still failed to find the solution.

I would really appreciate it if someone could help me solve this problem.

Many Thanks,

Ishack

ven diagram peptide venn count r • 4.3k views
ADD COMMENT
1
Entering edit mode
5.4 years ago
AK ★ 2.2k

Hi Ishack,

Try this:

df <-
  data.frame(
    TGEClass.known = c(
      "GVVEVTHDLQK",
      "LFYADHPFIFLVR",
      "SALQSINEWAAQTTDGK",
      "AVLSAEQLRDEEVHAGLGELLR"
    ),
    TGEClass.uknown = c(
      "GVVEVTHDLQK",
      "LFYADHPFIFLVR",
      "SALQSINEWAAQTTDGK",
      "AVLSAEQLRDEEVHAGLGELL"
    )
  )


# Present in both TGEClass.known and TGEClass.uknown
length(intersect(df$TGEClass.known, df$TGEClass.uknown))

# TGEClass.known only
length(setdiff(df$TGEClass.known, df$TGEClass.uknown))

# TGEClass.uknown only
length(setdiff(df$TGEClass.uknown, df$TGEClass.known))
ADD COMMENT
0
Entering edit mode

Hi SMK, Thanks very much for your answer but how can I get a table like this automatically, it is quite long to do it manually?

hw hm hr Counts
[1,]  0  0  0    113
[2,]  0  0  1     18
[3,]  0  1  0      8
[4,]  0  1  1      8
[5,]  1  0  0     12
[6,]  1  0  1      8
[7,]  1  1  0     11
[8,]  1  1  1     22
ADD REPLY
0
Entering edit mode

What are hw, hm, and hr?

ADD REPLY
0
Entering edit mode

Sorry those are meant to say TGEClass.uknown and TGEClass known. Please ignore the hw, hm and hr, I want table like that for TGEClass known and TGEClass unknown

ADD REPLY
0
Entering edit mode

Perhaps:

> df.venn <- data.frame(
+   TGEClass.known = c(1, 1, 0),
+   TGEClass.unknown = c(1, 0, 1),
+   Counts = c(length(
+     intersect(df$TGEClass.known, df$TGEClass.uknown)
+   ), length(
+     setdiff(df$TGEClass.known, df$TGEClass.uknown)
+   ), length(
+     setdiff(df$TGEClass.uknown, df$TGEClass.known)
+   ))
+ )
> df.venn
  TGEClass.known TGEClass.unknown Counts
1              1                1      3
2              1                0      1
3              0                1      1
> as.matrix(df.venn)
     TGEClass.known TGEClass.unknown Counts
[1,]              1                1      3
[2,]              1                0      1
[3,]              0                1      1
ADD REPLY
0
Entering edit mode

Hi SMK thanks a lot thats what was look for. Just one final question if you don't mind.

I have a lot of data frames like the one above but each one has a different number of categories and also different categories, would it be possible to intersect and setdif between all the different columns automatically?

ADD REPLY
0
Entering edit mode

Got an idea from the function: venn, here demonstrating 2 sets and 3 sets:

> library(gplots)
> # Two sets
> df1 <-
+   data.frame(
+     TGEClass.known = c(
+       "GVVEVTHDLQK",
+       "LFYADHPFIFLVR",
+       "SALQSINEWAAQTTDGK",
+       "AVLSAEQLRDEEVHAGLGELLR"
+     ),
+     TGEClass.uknown = c(
+       "GVVEVTHDLQK",
+       "LFYADHPFIFLVR",
+       "SALQSINEWAAQTTDGK",
+       "AVLSAEQLRDEEVHAGLGELL"
+     )
+   )
> venn.tab1 <- venn(as.list(df1), show.plot = FALSE)
> attr(venn.tab1, "intersections") <- NULL
> attr(venn.tab1, "class") <- NULL
> print(venn.tab1)
   num TGEClass.known TGEClass.uknown
00   0              0               0
01   1              0               1
10   1              1               0
11   3              1               1
> # Three sets
> df2 <-
+   data.frame(
+     TGEClass.set1 = c(
+       "GVVEVTHDLQK",
+       "LFYADHPFIFLVR",
+       "SALQSINEWAAQTTDGK",
+       "AVLSAEQLRDEEVHAGLGELLR"
+     ),
+     TGEClass.set2 = c(
+       "GVVEVTHDLQK",
+       "LFYADHPFIFLVR",
+       "SALQSINEWAAQTTDGK",
+       "AVLSAEQLRDEEVHAGLGELL"
+     ),
+     TGEClass.set3 = c(
+       "GVVEVTHDLQK",
+       "LFYADHPFIFLVR",
+       "SALQSINEWAAQTTDGKK",
+       "AVLSAEQLRDEEVHAGLGELLR"
+     )
+   )
> venn.tab2 <- venn(as.list(df2), show.plot = FALSE)
> attr(venn.tab2, "intersections") <- NULL
> attr(venn.tab2, "class") <- NULL
> print(venn.tab2)
    num TGEClass.set1 TGEClass.set2 TGEClass.set3
000   0             0             0             0
001   1             0             0             1
010   1             0             1             0
011   0             0             1             1
100   0             1             0             0
101   1             1             0             1
110   1             1             1             0
111   2             1             1             1
ADD REPLY
0
Entering edit mode

Hi SMK, Unfortunately, I found just now that I can't do a Venn diagram for more than 5 categories.

Can you help me create a df that looks like this please?

TGE-Class     Count
T1              1
T2              1
Both            6

Thanks very much

ADD REPLY
1
Entering edit mode
> library(gplots)
> df <-
+   data.frame(
+     T1 = c(
+       "GVVEVTHDLQK",
+       "LFYADHPFIFLVR",
+       "SALQSINEWAAQTTDGK",
+       "SALQSINEWAAQTTDGLL",
+       "SALQSINEWAAQTTDGTT",
+       "SALQSINEWAAQTTDGQQ",
+       "AVLSAEQLRDEEVHAGLGELLR"
+     ),
+     T2 = c(
+       "GVVEVTHDLQK",
+       "LFYADHPFIFLVR",
+       "SALQSINEWAAQTTDGK",
+       "SALQSINEWAAQTTDGLL",
+       "SALQSINEWAAQTTDGTT",
+       "SALQSINEWAAQTTDGQQ",
+       "AVLSAEQLRDEEVHAGLGELL"
+     )
+   )
> venn.tab <- venn(as.list(df), show.plot = FALSE)
> t(t(unlist(lapply(attr(venn.tab, "intersections"), length))))
      [,1]
T1       1
T2       1
T1:T2    6
ADD REPLY
0
Entering edit mode

Hi SMK,

Thanks very much for your quick response, I have been trying all day to fix this. You are a life saver!

ADD REPLY
0
Entering edit mode

Hi SMK, sorry for the lateness, is there a way to see the number of unique peptides from each category when there are blanks in columns, please?

the length code sees the blank cells as unique peptides, unfortunately.

ADD REPLY
0
Entering edit mode

Hi ishackm,

You can remove the empty element in list before you use venn:

l <- as.list(df)
l <- lapply(l, function(x) { x[!x == ""] })
venn.tab <- venn(l, show.plot = FALSE)
ADD REPLY
0
Entering edit mode

Hi SMK , thank you again for your quick response. Much Appreciated.

ADD REPLY
0
Entering edit mode

Cool, glad it helps!

ADD REPLY
1
Entering edit mode
5.4 years ago
zx8754 12k

Convert to TRUE/FALSE, then use limma venn counts:

# example data
df <-data.frame(
  TGEClass.known = c(
    "GVVEVTHDLQK",
    "LFYADHPFIFLVR",
    "SALQSINEWAAQTTDGK",
    "AVLSAEQLRDEEVHAGLGELLR"
  ),
  TGEClass.uknown = c(
    "GVVEVTHDLQK",
    "LFYADHPFIFLVR",
    "SALQSINEWAAQTTDGK",
    "AVLSAEQLRDEEVHAGLGELL"
  ), stringsAsFactors = FALSE
)

library(data.table)

x <- dcast(cbind(stack(as.list(df)), x = TRUE), 
           values ~ ind, 
           value.var = "x", 
           fill = FALSE)[, -1]    

limma::vennCounts(x)
#   TGEClass.known TGEClass.uknown Counts
# 1              0               0      0
# 2              0               1      1
# 3              1               0      1
# 4              1               1      3

limma::vennDiagram(x)
ADD COMMENT
0
Entering edit mode

Hi, I ran the code you gave me but it is giving me an error:

    df = read.csv("FN1.csv")
    FN1 = as.vector(df)



    library(data.table)

    x <- dcast(cbind(stack(as.list(FN1)), x = TRUE), 
               values ~ ind, 
               value.var = "x", 
               fill = FALSE)[, -1]    
    limma::vennCounts

(x)

Error in stack.default(as.list(FN1)) : 
  at least one vector element is required

What im I doing wrong here please?

ADD REPLY
0
Entering edit mode

You need to share your example CSV: FN1.csv, so that we can reproduce the problem.

ADD REPLY
0
Entering edit mode

Sorry for the late reply,

this is the csv I am using:

T2  T3
QHDMGHMMR   QHDMGHMMR
RPGGEPSPEGTTGQSYNQYSQR  RPGGEPSPEGTTGQSYNQYSQR
KTDELPQLVTLPHPNLHGPEILDVPSTVQK  KTDELPQLVTLPHPNLHGPEILDVPSTVQK
HRPRPYPPNVGEEIQIGHIPR   HRPRPYPPNVGEEIQIGHIPR
QHDMGHMMR   QHDMGHMMR
DQCIVDDITYNVNDTFHK  DQCIVDDITYNVNDTFHK
YYRITYGETGGNSPVQEFTVPGSK    YYRITYGETGGNSPVQEFTVPGSK

The code:

test = read.csv("test.csv", stringsAsFactors = FALSE)


library(gplots)
# example data



library(data.table)

x <- dcast(cbind(stack(as.list(df2)), x = TRUE), 
           values ~ ind, 
           value.var = "x", 
           fill = FALSE)[, -1]    

limma::vennCounts(x)
limma::vennDiagram(x)

The error:

Aggregation function missing: defaulting to length
Error in vapply(indices, fun, .default) : values must be type 'logical',
 but FUN(X[[1]]) result is type 'integer'

How can I fix this please?

ADD REPLY
0
Entering edit mode

Yes, as the your columns overlap fully TRUE/FALSE is not working, replace TRUE/FALSE with 1/0 in dcast, see below example:

# example data
df <-read.table(text = "
T2  T3
QHDMGHMMR   QHDMGHMMR
RPGGEPSPEGTTGQSYNQYSQR  RPGGEPSPEGTTGQSYNQYSQR
KTDELPQLVTLPHPNLHGPEILDVPSTVQK  KTDELPQLVTLPHPNLHGPEILDVPSTVQK
HRPRPYPPNVGEEIQIGHIPR   HRPRPYPPNVGEEIQIGHIPR
QHDMGHMMR   QHDMGHMMR
DQCIVDDITYNVNDTFHK  DQCIVDDITYNVNDTFHK
YYRITYGETGGNSPVQEFTVPGSK    YYRITYGETGGNSPVQEFTVPGSK", stringsAsFactors = FALSE, header = TRUE)

library(data.table)

x <- dcast(cbind(stack(as.list(df)), x = 1), 
           values ~ ind, 
           value.var = "x", 
           fill = 0)[, -1]

limma::vennCounts(x)

#   T2 T3 Counts
# 1  0  0      0
# 2  0  1      0
# 3  1  0      0
# 4  1  1      6
# attr(,"class")
# [1] "VennCounts"
ADD REPLY
0
Entering edit mode

Thanks very much for your quick response

ADD REPLY
0
Entering edit mode

Hi, Unfortunately, I found just now that I can't do a Venn diagram for more than 5 categories.

Can you help me create a df that looks like this please?

TGE-Class     Count
T1              1
T2              1
Both            6

Thanks very much

ADD REPLY
0
Entering edit mode
5.4 years ago

if you are looking for exact mactches (so no peptide can be subset of another) you can use your lists as such as input for DrawVenn . It's an online tool for drawing venn diagrams

ADD COMMENT
0
Entering edit mode

Thanks to all for their help and support. This is exactly what was looking for

ADD REPLY

Login before adding your answer.

Traffic: 790 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6