Question: Identifying repeated genes from multiple lists
1
gravatar for Megatron
5.6 years ago by
Megatron10
Canada
Megatron10 wrote:

Hello,

 

I have 25 lists of genes.

On each list, there are anywhere from 1-50 genes

I want to process these lists to find, between these 25 lists, which genes show up most frequently. 

Can anyone help?

What I have tried on R: 

Loading all 25 lists, and then 

Reduce(intersect, list(a,b,c))

However: when inputting 25 lists, it usually gives me a null because no single gene appears on all 25 lists. 

My aim is to have a result where I have a list of genes listed by frequency of appearance within these 25 lists.

Thanks

R gene • 1.3k views
ADD COMMENTlink modified 5.6 years ago by alolex910 • written 5.6 years ago by Megatron10
cat list.* | sort | uniq -c | sort -n | tail 

?

ADD REPLYlink written 5.6 years ago by Pierre Lindenbaum133k

Can you be a little more specific? I'm new at R.

I uploaded the lists onto my Global Environment so I'm not sure if I have to cat list or sort.

When I did unique(list1, list2, list 3...list 25) it says hash table is full ---not sure what "-c" means

Thanks for your patience this is probably so easy for you

If it helps, here is my data specifically that I enter into the R Console:

Load lists:

list1 <- c(65,84,137,159,164,209,209,221,330)

list10 <- c(3,7,25,28,44,44,46,54,58,66,69,85,88,109,129,155,155,168,187,187,187,190,191,196,204,208,233,247,262,275,288,316,333,347,350,356)

list11 <- c(12,33,52,61,63,67,75,79,81,82,87,95,99,101,108,114,121,130,132,138,144,147,147,148,165,171,173,178,182,189,197,202,220,229,234,236,238,240,246,247,259,262,263,274,276,280,280,290,298,308,312,326,329,331,335,337,339,341)

list13 <- c(17,36,71,73,74,91,96,123,150,205,211,213,255,277,307,318,339,342,358)

list14 <- c(4,4,5,5,7,15,20,29,31,62,78,80,104,109,117,127,130,132,161,179,184,188,192,194,195,200,202,206,218,230,232,235,242,245,257,257,259,261,281,292,293,302,304,306,310,311,324,327,336,345,354)

list15 <- c(50,103,121,136,156,174,187,247,251,253,258,310,319,336,343)

list16 <- c(11,109,128,140,172,181,188,201,207,247,247,265,279,344,356,358)

list17 <- c(10,21,59,199,299)

list18 <- c(53,57,63,90,165,176,198,243,315,338,351)

list19 <- c(6,9,23,35,53,94,106,107,113,118,124,126,146,146,203,216,237,244,248,266,268,285,286,289,296,298,300,300,314,340)

list2 <- c(20,35,39,49,79,105,111,116,119,130,141,143,147,147,151,159,160,167,174,180,212,214,239,250,252,256,267,271,291,301,305,307,318,322,351)

list20 <- c(320)

list21 <- c(346)

list3 <- c(2,13,38,55,70,81,88,98,115,133,133,153,154,162,169,183,212,274,340,348,349,355)

list4 <- c(270,278,354)

list5 <- c(32,135,196,297)

list6 <- c(290,316,317)

list7 <- c(14,26,34,41,42,76,132,163,186,222,225,231,232,239,269,272,303,313,334,352,353,356,357)

list8 <- c(4,8,16,30,40,43,47,56,97,98,110,122,130,149,185,217,236,236,282,321)

list9 <- c(1,11,16,18,19,20,22,24,27,32,37,45,48,51,60,63,64,68,69,70,72,77,83,86,89,91,92,93,100,102,104,112,116,120,122,123,125,131,134,139,142,145,152,157,158,162,164,166,170,170,171,175,177,183,193,210,215,219,223,224,226,227,228,232,241,247,249,250,254,260,264,272,273,280,280,280,283,284,287,294,295,298,301,309,310,313,320,323,324,325,328,332,356)

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Megatron10
3
gravatar for ethan.kaufman
5.6 years ago by
ethan.kaufman380
Canada
ethan.kaufman380 wrote:

sort(table(c(list1, list2, ..., list25)), decreasing=T)
 

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by ethan.kaufman380
0
gravatar for Alex Reynolds
5.6 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

What is the format of your gene list?

If it is just a text file split by newlines, then you could easily do this on the command line with awk:

$ cat geneListA.txt geneListB.txt ... geneListN.txt \
    | awk ' \
        { \
            geneCounts[$0]++; \
        } \
        END { \
            for (geneName in geneCounts) { \
                print geneName"\t"geneCounts[geneName]; \
            } \
        }' - \
    > unsortedCounts.txt

The file unsortedCounts.txt is an unsorted two-column file containing the gene name and its count across files geneListA.txt through geneListN.txt.

To sort this by counts, just pipe the output of the awk statement to GNU sort and do a (descending) numeric sort on the second column:

$ cat geneListA.txt geneListB.txt ... geneListN.txt \
    | awk ' \
        { \
            geneCounts[$0]++; \
        } \
        END { \
            for (geneName in geneCounts) { \
                print geneName"\t"geneCounts[geneName]; \
            } \
        }' - \
    | sort -n k2,2r - \
    > sortedCounts.txt
ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Alex Reynolds31k

Thanks, let me try this and report back

ADD REPLYlink written 5.6 years ago by Megatron10

I just tried to download gawk for windows + source files, accessed Gnuwin32/bin/awk etc on MS DOS and placed the genelists in the directory - I am completely lost though. Maybe staying on R is a better option

edit: or if you could provide some simpler steps

 

cheers

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Megatron10

From MSDOS, I realized that type is the equivalent of cat

so I did cat genelist1.txt genelist2.txt genelist3.txt and in the cmd all the lists were printed out

Then, 

gawk { \geneCounts[$0]++; \} gives me an invalid character

 

ADD REPLYlink written 5.6 years ago by Megatron10
2

Don't do bioinformatics on Windows. Sorry to be a snob about it, but you'll otherwise have to jump through numerous hoops to do common command-line tasks like these. Either swap out your OS or run your analyses within a Linux VM in VirtualBox or similar.

ADD REPLYlink written 5.6 years ago by Alex Reynolds31k
0
gravatar for alolex
5.6 years ago by
alolex910
United States
alolex910 wrote:

Megatron,

I had to do something similar in R a while ago.  You'll need to create a list of lists, a list of unique gene IDs and then a matrix of counts.  Here is my code:

my.lists <- list(list1=c(123,234,345), list2=c(45,23,12,78,43,87,123), list3=c(123,432,234,45,23))
unique_genes <- unique(unlist(my.lists))
#set up empty matrix
mtx <- matrix(0, nrow=length(names(my.lists)), ncol=length(unique_genes))
rownames(mtx) <- names(my.lists)
colnames(mtx) <- unique_genes
#populate the matrix
for(i in rownames(mtx)){
    mtx[i,(colnames(mtx) %in% my.lists[[i]])] <- 1
}
freqSorted <- sort(colSums(mtx), decreasing=T)

 

ADD COMMENTlink written 5.6 years ago by alolex910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1116 users visited in the last hour