Question

How to remove duplicate rows by averaging their values having 1 million rows X 11 columns in Excel.

0

Entering edit mode

16 months ago

alvea.tasneem • 0

Hello,

I have a dataset having more than 1 million rows and 11 columns and I tried using consolidate average function to merge the duplicate rows by averaging their respective values. However, since my dataset is very large, excel is not giving me the result.

Can anyone guide how I can perform consolidate function with 1 million rows in excel? is there any command I can use in excel similar to consolidate function?

Thank you.

consolidate large dataset duplicates average excel • 1.4k views

ADD COMMENT • link updated 16 months ago by Gordon Smyth ★ 7.0k • written 16 months ago by alvea.tasneem • 0

0

Entering edit mode

Excel is not really suitable for crunching larger datasets or bioinformatics in general (is it bioinformatics?). Exporting your dataset into a CSV file and doing the calculation in R is my best bet. How do you specify duplicated rows if the values are different, possibly based on an identifier?

ADD REPLY • link 16 months ago by Michael 54k

0

Entering edit mode

yeah it is bioinformatics. and in R if I have to do it then what code should I use? my identifier is gene symbols and I need to merge duplicates by averaging the sample values of respective gene symbols.

ADD REPLY • link 16 months ago by alvea.tasneem • 0

0

Entering edit mode

Generally speaking, you should invest the time to learn R or Python properly if you plan to perform these kinds of tasks more frequently. But if it is a one of thing, we might be able to help you if you can post a few lines from your file and explain what exactly you are trying to achieve (awk might work as well).

You can also give Galaxy a spin, which provides a graphical user interface, but still gives you access to powerful tools in the background. Depending on the complexity of your tasks, the grouping tools might already work for you.

ADD REPLY • link 16 months ago by Matthias Zepper 4.5k

score 0 · Answer 1 · 2022-12-05

0

Entering edit mode

16 months ago

Gordon Smyth ★ 7.0k

The avereps() function of the limma package does exactly that. In R of course rather than Excel.

But see https://support.bioconductor.org/p/90980/ for why averaging might not be the best approach.

ADD COMMENT • link 16 months ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Thank you Gordon Smyth for the advice.

I am actually new to processing microarray data and this is my first one.

I have a code that I am running that removes duplicate gene symbols by averaging their sample values. However, when I ran that code, it is not averaging the duplicate genes correctly.

mapped_data.gse43754 <- read.delim("GENEIDseries.txt", header=TRUE, sep = "\t") AGG <- aggregate(mapped_data.gse43754[,-c(1)], by=list(Gene = mapped_data.gse43754$GENE_SYMBOL), FUN=mean, na.rm =TRUE) rownames(AGG) <- AGG[,1] AGG[,1] <- NULL

This is the file it is generating,

GENE_SYMBOLS GSM1070367 GSM1070368 GSM1070369 GSM1070370 GSM1070371 GSM1070372 GSM1070373 GSM1070374 GSM1070375 GSM1070376 A1CF 1.397 1.353 1.531 1.270 1.475 1.896 1.409 2.128 1.873 1.365 A1CF 4.157 4.103 3.925 4.028 4.135 3.971 4.154 4.129 3.953 3.844 A4GNT 4.883 5.128 4.871 4.976 5.040 5.437 5.326 4.959 5.143 5.379 AAA1 3.900 3.664 3.591 3.646 3.556 3.533 3.666 3.726 3.689 3.696 AAA1 4.832 4.377 4.881 4.487 4.070 4.842 4.632 4.765 4.782 4.184 AADAC 2.302 2.395 2.282 2.695 2.477 2.473 2.278 2.545 2.305 2.289

Here, gene id A1CF and AAA1 occur twice hence, the code above is not removing the duplicates.

Please, help me in resolving this matter...

Thank you.

ADD REPLY • link 16 months ago by alvea.tasneem • 0

0

Entering edit mode

I already pointed you to an efficient, easy solution that you have chosen to ignore.

ADD REPLY • link 16 months ago by Gordon Smyth ★ 7.0k