Question: Identifying repeated genes from multiple lists
5.6 years ago
Megatron10
Megatron wrote:

Hello,

I have 25 lists of genes.

On each list, there are anywhere from 1-50 genes

I want to process these lists to find, between these 25 lists, which genes show up most frequently.

Can anyone help?

What I have tried on R:

``Reduce(intersect, list(a,b,c))``

However: when inputting 25 lists, it usually gives me a null because no single gene appears on all 25 lists.

My aim is to have a result where I have a list of genes listed by frequency of appearance within these 25 lists.

Thanks

R gene • 1.3k views
written 5.6 years ago by Megatron
`cat list.* | sort | uniq -c | sort -n | tail `

?

Can you be a little more specific? I'm new at R.

I uploaded the lists onto my Global Environment so I'm not sure if I have to cat list or sort.

When I did unique(list1, list2, list 3...list 25) it says hash table is full ---not sure what "-c" means

Thanks for your patience this is probably so easy for you

If it helps, here is my data specifically that I enter into the R Console:

list1 <- c(65,84,137,159,164,209,209,221,330)

list10 <- c(3,7,25,28,44,44,46,54,58,66,69,85,88,109,129,155,155,168,187,187,187,190,191,196,204,208,233,247,262,275,288,316,333,347,350,356)

list11 <- c(12,33,52,61,63,67,75,79,81,82,87,95,99,101,108,114,121,130,132,138,144,147,147,148,165,171,173,178,182,189,197,202,220,229,234,236,238,240,246,247,259,262,263,274,276,280,280,290,298,308,312,326,329,331,335,337,339,341)

list13 <- c(17,36,71,73,74,91,96,123,150,205,211,213,255,277,307,318,339,342,358)

list14 <- c(4,4,5,5,7,15,20,29,31,62,78,80,104,109,117,127,130,132,161,179,184,188,192,194,195,200,202,206,218,230,232,235,242,245,257,257,259,261,281,292,293,302,304,306,310,311,324,327,336,345,354)

list15 <- c(50,103,121,136,156,174,187,247,251,253,258,310,319,336,343)

list16 <- c(11,109,128,140,172,181,188,201,207,247,247,265,279,344,356,358)

list17 <- c(10,21,59,199,299)

list18 <- c(53,57,63,90,165,176,198,243,315,338,351)

list19 <- c(6,9,23,35,53,94,106,107,113,118,124,126,146,146,203,216,237,244,248,266,268,285,286,289,296,298,300,300,314,340)

list2 <- c(20,35,39,49,79,105,111,116,119,130,141,143,147,147,151,159,160,167,174,180,212,214,239,250,252,256,267,271,291,301,305,307,318,322,351)

list20 <- c(320)

list21 <- c(346)

list3 <- c(2,13,38,55,70,81,88,98,115,133,133,153,154,162,169,183,212,274,340,348,349,355)

list4 <- c(270,278,354)

list5 <- c(32,135,196,297)

list6 <- c(290,316,317)

list7 <- c(14,26,34,41,42,76,132,163,186,222,225,231,232,239,269,272,303,313,334,352,353,356,357)

list8 <- c(4,8,16,30,40,43,47,56,97,98,110,122,130,149,185,217,236,236,282,321)

list9 <- c(1,11,16,18,19,20,22,24,27,32,37,45,48,51,60,63,64,68,69,70,72,77,83,86,89,91,92,93,100,102,104,112,116,120,122,123,125,131,134,139,142,145,152,157,158,162,164,166,170,170,171,175,177,183,193,210,215,219,223,224,226,227,228,232,241,247,249,250,254,260,264,272,273,280,280,280,283,284,287,294,295,298,301,309,310,313,320,323,324,325,328,332,356)

5.6 years ago
ethan.kaufman380
ethan.kaufman wrote:

sort(table(c(list1, list2, ..., list25)), decreasing=T)

5.6 years ago
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds wrote:

What is the format of your gene list?

If it is just a text file split by newlines, then you could easily do this on the command line with `awk`:

```\$ cat geneListA.txt geneListB.txt ... geneListN.txt \
| awk ' \
{ \
geneCounts[\$0]++; \
} \
END { \
for (geneName in geneCounts) { \
print geneName"\t"geneCounts[geneName]; \
} \
}' - \
> unsortedCounts.txt```

The file `unsortedCounts.txt` is an unsorted two-column file containing the gene name and its count across files `geneListA.txt` through `geneListN.txt`.

To sort this by counts, just pipe the output of the `awk` statement to GNU `sort` and do a (descending) numeric sort on the second column:

```\$ cat geneListA.txt geneListB.txt ... geneListN.txt \
| awk ' \
{ \
geneCounts[\$0]++; \
} \
END { \
for (geneName in geneCounts) { \
print geneName"\t"geneCounts[geneName]; \
} \
}' - \
| sort -n k2,2r - \
> sortedCounts.txt```

Thanks, let me try this and report back

I just tried to download gawk for windows + source files, accessed Gnuwin32/bin/awk etc on MS DOS and placed the genelists in the directory - I am completely lost though. Maybe staying on R is a better option

edit: or if you could provide some simpler steps

cheers

From MSDOS, I realized that type is the equivalent of cat

so I did cat genelist1.txt genelist2.txt genelist3.txt and in the cmd all the lists were printed out

Then,

`gawk { \geneCounts[\$0]++; \} gives me an invalid character`

Don't do bioinformatics on Windows. Sorry to be a snob about it, but you'll otherwise have to jump through numerous hoops to do common command-line tasks like these. Either swap out your OS or run your analyses within a Linux VM in VirtualBox or similar.

5.6 years ago
alolex910
United States
alolex wrote:

Megatron,

I had to do something similar in R a while ago.  You'll need to create a list of lists, a list of unique gene IDs and then a matrix of counts.  Here is my code:

```my.lists <- list(list1=c(123,234,345), list2=c(45,23,12,78,43,87,123), list3=c(123,432,234,45,23))
unique_genes <- unique(unlist(my.lists))
#set up empty matrix
mtx <- matrix(0, nrow=length(names(my.lists)), ncol=length(unique_genes))
rownames(mtx) <- names(my.lists)
colnames(mtx) <- unique_genes
#populate the matrix
for(i in rownames(mtx)){
mtx[i,(colnames(mtx) %in% my.lists[[i]])] <- 1
}
freqSorted <- sort(colSums(mtx), decreasing=T)```