Question

Why am I getting only one common value when comparing two excel files?

0

Entering edit mode

20 months ago

namck • 0

I have two excel files which have quite a a lot common gene names. What I would like to do is finding the common gene names and adding the corresponding columns with their values to a new csv file.

The below code gives an output containing a single row, which should not be the case as I have sufficient common gene names. Therefore, can someone please tell me what I'm doing wrong here?

Thanks!

First excel file,

Second excel file,

Ouput what I got, (the formatting is correct, but why am I getting only one?) My desired output should look like this

Snippet from the code, snippet from the code I've tried

python pandas excel • 1.0k views

ADD COMMENT • link updated 20 months ago by seidel 11k • written 20 months ago by namck • 0

0

Entering edit mode

Are you certain there are absolute matches between the two? The example shows df2 with many gene names capitalized. These won't match if they are not also capitalized in df1. Can you sort each and show a top set with names in common? Could there be a trailing space, or other character in one of the gene name columns? (unlikely, but I've seen this happen).

ADD REPLY • link 20 months ago by seidel 11k

0

Entering edit mode

Thanks! I have made both the files lowercase and run the same file. This time it returned a list of common genes, but subsequent column are missing now. I am getting output like this, new_output

ADD REPLY • link 20 months ago by namck • 0

0

Entering edit mode

According to your code, you asked for a set of common values, and that's what it returned. Are you trying to merge the two data frames with these values? (I gave an answer for this below). Or are you trying to subset one data frame by the values in common? What code are you running for the image above? What were you trying to achieve?

ADD REPLY • link 20 months ago by seidel 11k

score 0 · Answer 1 · 2023-02-02

0

Entering edit mode

20 months ago

bkleiboeker ▴ 370

It looks like the first dataframe has gene names with the first letter capitalized, while the second dataframe has the gene names as all lowercase.

In any case, I'd personally suggest trying to go upstream and use Ensembl or Entrez IDs instead of gene symbols if possible.

ADD COMMENT • link 20 months ago by bkleiboeker ▴ 370

score 0 · Answer 2 · 2023-02-02

You mentioned fixing the case of the values shared by each table, so that they will match. If you'd like to join the two tables together, you could do so using the merge command. For example:

import pandas as pd

# create two data frames
df1 = pd.DataFrame({'gene': ['act1', 'gapdh', 'foomoo'], 'expression': [3,1,-12], 'chr': ['chr1', 'chr12', 'chr39']})
df2 = pd.DataFrame({'ID': ['oct4', 'gapdh', 'act1'], 'tiktok_rank': [23,145,3], 'fav_cell_type': ['skin', 'lips', 'banana']})

# see what they have in common
set(df1['gene']).intersection(set(df2['ID']))

{'act1', 'gapdh'}

# merge them to one
pd.merge(df1, df2, left_on='gene', right_on='ID')

merged