Question: common data extraction
gravatar for Azhar
3.3 years ago by
Azhar40 wrote:

I have excel file with 4 spread sheets and each sheet has one column of gene name and corresponding fold change and expression and other values, each column in row number is diffrenet so i want to compare these column and get common gene names in two or three or all four sheets with there respective expression and foldchange values ?


A    foldchange  B foldchange  C foldchange D foldchange 
a       5                 a     5                    k       9         d          5  
c          7               g       10                g       10      g       10  
d          5                h         11            h         11      a     5         
                              c          7                                   h         11
d          5        
g          2
h            8

and output i need like

gene name        A    FC          B    FC      C   FC         D FC \
                        a   5             a      5       NA    0       NA 0
rna-seq job • 1.3k views
ADD COMMENTlink modified 2.8 years ago by aarti0 • written 3.3 years ago by Azhar40

Hi m.azhar, do you consider exporting the data into text files and use R or unix tools for this task? I wouldn´t expect much help with doing this in Excel from Biostars, because Excel is not considered a bioinformatics tool by most.

ADD REPLYlink written 3.3 years ago by Michael Dondrup46k

yes thanks for advice can you sugesst some method tool or script how to do this

ADD REPLYlink written 3.3 years ago by Azhar40

Please google on how to save an Excel file in CSV format. If you have data in multiple sheets in the workbook, you may have to export each sheet to a separate file. Once this is done, you can read the CSV using any plain text processing tool.

ADD REPLYlink written 3.3 years ago by RamRS24k

This is not necessarily a good idea, depending on the actual format of the original data. It is very likely that the OP's source data is not actually in "Excel" format, but is already in some flat text format such as TSV, TXT (tab-delimited), CSV (comma-delimited) or XLS (as opposed to XLSX). If this is the case, then opening the files in Excel and then saving them in another format is likely to introduce formatting errors. Excel's built-in auto-correct feature is very prone to mangling gene names and NA values and saving them in formats that make it more difficult to handle later. The OP needs to figure out exactly what format the source data is in by looking at the file extension and attempting to open the files in a raw text viewer such as TextEdit (Mac) or Notepad (Windows). Once this information is obtained, then corresponding data conversion steps can be considered.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by steve2.4k

True. I was assuming OP had checked for such file association related goof-ups. I think we should, as one of the first steps of setting up a bioinformatics machine, decouple CSV and TSV files from spreadsheet programs and have them open in an advanced text editor, such as TextWrangler or Notepad++.

(I avoid TextEdit and notepad because I cannot trust them to detect EOL conversions, vertical tabs, etc.)

ADD REPLYlink written 3.3 years ago by RamRS24k
gravatar for aarti
2.8 years ago by
aarti0 wrote:

You can compare four data sheets in Excel also. All you need is minimum one common column across sheets (in your case gene identifier). You can apply VLOOKUP

ADD COMMENTlink written 2.8 years ago by aarti0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1745 users visited in the last hour