getting intersect between two lists of genes in R
1
2
Entering edit mode
6.6 years ago
Angel ★ 4.1k

friends,

i have a microarray dataset AGI IDs in row and samples in column and txt file of transcription factor (AGI IDs), my supervisor asked me to find and extract the expression of only genes existed in my transcription factor list in another word i should extract the rows those are transcription factor. as already i need your help

thank you

myposts R gene • 28k views
4
Entering edit mode
6.6 years ago

Take a look at ?intersect.

> v1 <- c("GATA1", "CTCF", "HOXB1")
> v2 <- c("C3", "ATF", "SOX1", "CTCF", "OCT3")
> intersect(v1, v2)
[1] "CTCF"
> 

If you are trying to build a subset of a data frame, you can use %in%. See the examples here.

0
Entering edit mode

sorry Alex,

I did like below but the input file contains the same rows as my microarray dataset while I need only the rows shared between two files

> setwd("E:/Affy data Col-0 priming/detailed_results")
> v1 <- c(mycounts)
> v2 <- c(S)
> intersect(v1, v2)
list()
> list()
list()
> write.table(mycounts, file = "final_list.txt", sep = "\t", col.names = TRUE, row.names = FALSE)
>

0
Entering edit mode

What you are trying to pass to intersect is not what that function takes as input. Run ?intersect to see what it takes as input. Then look again at tables mycounts and S and decide what rows or columns you need to extract from those tables to pass to intersect.

0
Entering edit mode

thank you, but mycounts has 22000 rows and S has 100 rows, then I need 100 rows from mycounts which the same with S

0
Entering edit mode

You need to extract whichever row or column from mycounts contains your genes of interest. You need to do the same for S. Then pass those to intersect.

0
Entering edit mode

sorry, I donno how to extract...then I should extract 22000 rows from one file and 100 rows from another one then pass those to intersect.

0
Entering edit mode

Perhaps you are trying to take a subset of one table, based on values in a second table. It's not clear to me what you are trying to do. Maybe you need to use the %in% operator. Perhaps read the edit to my answer and go through the linked examples, as well as search online about this operator.

0
Entering edit mode

Alex,

I am only trying to extract 100 genes from 22000 genes. 100 genes are in S file and 22000 genes are in my microarray dataset

anyway thank you

1
Entering edit mode

Alex gave you the correct hint. You only need to select the corresponding column IDs:

intersect(v1$id, v2$probe_id)


In case the columns are named id and probe_id

0
Entering edit mode

thank you

I did like below

mycounts <- read.table("RMA.txt", sep="\t", header=TRUE)
> v1 <- c(mycounts)
> v2 <- c(S)
> intersect(v1$identifier, v2$IDs)
NULL
>


what is the NULL??

1
Entering edit mode

Try to leave out the c(mycounts), this will mess up your data...

intersect(mycounts$identifier,S$IDs)

0
Entering edit mode

thank you,

> mycounts <- read.table("RMA.txt", sep="\t", header=TRUE)
> intersect(mycounts$identifier,S$IDs)
NULL
> write.table(mycounts, file = "RMA111.txt", dec = ".", sep = "\t", quote = FALSE)


I open the output file but it has 21000 row while I need only the 100 genes which I name with S variable

2
Entering edit mode

In case of reading S, you state header=F. Which means you have S$V1 representing the first column. Please have a look at your data (e.g. with head(S)) and what data type is used. ADD REPLY 0 Entering edit mode sorry, the result was the same > head(S) IDs 1 AT1G53540 2 AT4G10250 3 AT5G12020 4 AT4G27670 5 AT5G59720 6 AT1G18970  > setwd("E:/Affy data Col-0 priming") > raw_rma_normalized_expression_values <- read.delim("E:/Affy data Col-0 priming/raw_rma_normalized_expression_values.txt", header=FALSE) > View(raw_rma_normalized_expression_values) > excluding <- read.csv("E:/Affy data Col-0 priming/excluding.txt", sep="") > View(excluding) > mycounts <- read.table("RMA.txt", sep="\t", header=TRUE) > S <- read.table("excluding.txt", sep="", header=T) > intersect(mycounts$identifier,S$IDs) character(0) > write.table(mycounts, file = "RMA111.txt", dec = ".", sep = "\t", quote = FALSE) > mycounts <- read.table("RMA.txt", sep="\t", header=TRUE) > S <- read.table("excluding.txt", sep="", header=F) > S <- read.table("excluding.txt", sep="", header=T) > v1 <- c(mycounts) > v2 <- c(S) > subset(mycounts, identifier %in% S$V1)
Error in match(x, table, nomatch = 0L) : object 'identifier' not found
> intersect(mycounts$identifier,S$IDs)
character(0)
> subset(mycounts, identifier %in% S$V1) Error in match(x, table, nomatch = 0L) : object 'identifier' not found >  ADD REPLY 0 Entering edit mode subset(mycounts, identifier %in% S$V1)

0
Entering edit mode
subset(mycounts, identifier %in% S$V1) Error in match(x, table, nomatch = 0L) : object 'identifier' not found  ADD REPLY 0 Entering edit mode identifier: name of the column containing IDs in your mycounts dataset S$V1: the first column of your S dataset

0
Entering edit mode

Yes but told Error in match(x, table, nomatch = 0L) : object 'identifier' not found

1
Entering edit mode

Just substitute identifier with the name of the id column in your mycounts data frame. The error you get means that there is no column called identifier in your mycounts dataframe. It would be easier to help you posted at least the first lines of the datasets you want to intersect.

0
Entering edit mode

sorry, look at the below please

> head(mycounts)
Identifier GSM128877.CEL GSM131224.CEL GSM128879.CEL GSM131223.CEL GSM131226.CEL
1  AT1G01010      5.376319      5.781676      5.205693      5.753107      6.853383
2  AT1G01030      5.042002      5.449904      4.823182      5.340101      4.481017
3  AT1G01040      5.715077      7.453879      5.263682      7.668929      7.782487
4  AT1G01050      9.355328      9.922944      9.380094      9.816947     10.250628
5  AT1G01060      9.167021     11.080291      9.009276     10.990459      7.695344
6  AT1G01070      5.580228      6.058605      5.448064      5.858959      7.468354


then what is the solution?

1
Entering edit mode
subset(mycounts, Identifier %in% S$V1) # upper-case Identifier  ADD REPLY 0 Entering edit mode thank you, no error anymore but the output file contains 22000 yet while my purpose here is extracting only 100 rows (the same excluding gene list) from those 22000 rows in mycounts. maybe intersection means different and I should apply another word for my question. something like removing all rows except the 100 genes in excluding file. ADD REPLY 0 Entering edit mode How many rows there are in dataset S? Can you post the first lines? ADD REPLY 0 Entering edit mode totally 102 lines ADD REPLY 0 Entering edit mode The code above should extract only the 102 genes in the mycounts dataset. See this example: > mycounts = read.table('mycounts.csv') > mycounts Identifier GSM128877.CEL GSM131224.CEL GSM128879.CEL GSM131223.CEL GSM131226.CEL 1 AT1G01010 5.376319 5.781676 5.205693 5.753107 6.853383 2 AT1G01030 5.042002 5.449904 4.823182 5.340101 4.481017 3 AT1G01040 5.715077 7.453879 5.263682 7.668929 7.782487 4 AT1G01050 9.355328 9.922944 9.380094 9.816947 10.250628 5 AT1G01060 9.167021 11.080291 9.009276 10.990459 7.695344 6 AT1G01070 5.580228 6.058605 5.448064 5.858959 7.468354 > S = read.table('genes_to_keep.csv', header=F) > S V1 1 AT1G01030 2 AT1G01060 > subset(mycounts, Identifier %in% S$V1)
Identifier GSM128877.CEL GSM131224.CEL GSM128879.CEL GSM131223.CEL GSM131226.CEL
2  AT1G01030      5.042002      5.449904      4.823182      5.340101      4.481017
5  AT1G01060      9.167021     11.080291      9.009276     10.990459      7.695344


I am not sure if this is what you are asking, but you can do the inverse filter by using the ! operator:

> subset(mycounts, ! Identifier %in% S$V1) Identifier GSM128877.CEL GSM131224.CEL GSM128879.CEL GSM131223.CEL GSM131226.CEL 1 AT1G01010 5.376319 5.781676 5.205693 5.753107 6.853383 3 AT1G01040 5.715077 7.453879 5.263682 7.668929 7.782487 4 AT1G01050 9.355328 9.922944 9.380094 9.816947 10.250628 6 AT1G01070 5.580228 6.058605 5.448064 5.858959 7.468354  ADD REPLY 0 Entering edit mode Note that subset will only save the filtered dataset on the screen. To save it to a file, you need to assign it to a variable and save it: > mycounts.filtered = subset(mycounts, Identifier %in% S$V1)
> write.table(mycounts.filtered, file = "RMA111.txt", dec = ".", sep = "\t", quote = FALSE)