Question

Merging two files based on a particular column

1

Entering edit mode

4.6 years ago

mohammedtoufiq91 ▴ 250

Hi,

I have managed to merge two different files using a common column which has the same information using the below function:

Combine_Files <- merge(File1, File2, by="Symbols")

However, I would like to know if its possible to merge two different *.csv files i.e. one file (file 1) with the gene symbols only (one gene per row), and the other file (file 2) with the gene symbols and other annotation columns (many related genes per row separated by comma but one of the gene is common as (file 1). I would like to merge both the files based on the mapping of the gene symbols from file1 and extract other annotations columns from (file 2) in the combined file for further data analysis. I would like to know how this could be done.

For example: File_1

    Gene_Symbols
 1. GeneA1
 2. GeneA2
 3. GeneA3

For example: File_2

    Gene_Symbols
 1. GeneA1, GeneX1, GeneX2, GeneD
 2. GeneA2, GeneL1, GeneP2, GeneNA
 3. GeneB3, GeneA3, GeneLP1, GeneNA1

Other columns in File_2

Phenotype, GO Ontology, Pathways

Expected Output

Gene_Symbols, Phenotype, GO Ontology, Pathways

Thank you,
Toufiq

r • 1.5k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 4.6 years ago by mohammedtoufiq91 ▴ 250

1

Entering edit mode

"Other columns" are they part of File2? If yes, could you update your example, and provide expected output.

ADD REPLY • link 4.6 years ago by zx8754 11k

0

Entering edit mode

Yes, other columns are part of file 2. File 1 has one column and file 2 has four columns. The expected out should have 4 columns.

Does this example help?

Gene_Symbols_(file_1)

GeneA1
GeneA2
GeneA3

Gene_Symbols_(file_2) Phenotype GO Ontology Pathways

GeneA1,GeneX1,GeneX2,GeneD WO cell processes Pathway1
GeneA2,GeneL1,GeneP2,GeneNA KT inhibition Pathway2
GeneB3,GeneA3,GeneLP1,GeneNA1 WO1 activation Pathway3

Gene_Symbols_(Combined_file) Phenotype GO Ontology Pathways

GeneA1 WO cell processes Pathway1
GeneA2 KT inhibition Pathway2
GeneA3 WO1 activation Pathway3

ADD REPLY • link 4.6 years ago by mohammedtoufiq91 ▴ 250

0

Entering edit mode

Maybe you should use the grep function of R to simply grep the gene symbol of file 1 to that in file 2 and if the grep is not null you paste the colums to that row. You will need a loop, probably (or maybe a clever use of apply).

ADD REPLY • link 4.6 years ago by Fabio Marroni ★ 3.0k

score 4 · Accepted Answer · 2019-09-24

4

Entering edit mode

4.6 years ago

i.sudbery 19k

Use the tidyr package. The function separate_rows will allow you to take you file_2 and split each row so that each row has only one gene symbol, and the information in the other columns is copied. Thus

GeneA1,GeneX1,GeneX2    WO cell processes Pathway 1

becomes

GeneA1    WO cell processes Pathway 1
GeneX1    WO cell processes Pathway 1
GeneX2    WO cell processes Pathway 1

You can then do your merge as before.

ADD COMMENT • link 4.6 years ago by i.sudbery 19k

0

Entering edit mode

Thank you. But I would like the output as below:

GeneA1 WO cell processes Pathway1

I would like to ignore unmatching genes (just retain the matching one's) or put them in other columns.

ADD REPLY • link 4.6 years ago by mohammedtoufiq91 ▴ 250

1

Entering edit mode

You merge your file 1 with the file 2 obtained as i.sudbery suggested and you get what you want (merge by default discard non-matching rows)

ADD REPLY • link 4.6 years ago by Fabio Marroni ★ 3.0k

2

Entering edit mode

Exactly.

If you merge

Symbol    Phenotype   Go Ontology   Pathway
GeneA1    WO cell processes Pathway 1
GeneX1    WO cell processes Pathway 1
GeneX2    WO cell processes Pathway 1

With

Symbol
GeneA1

You'll get

Symbol    Phenotype   Go Ontology   Pathway
GeneA1    WO cell processes Pathway 1

ADD REPLY • link 4.6 years ago by i.sudbery 19k

0

Entering edit mode

Thank you so much. This was helpful. I obtained the expected out.

ADD REPLY • link 4.6 years ago by mohammedtoufiq91 ▴ 250

1

Entering edit mode

You can accept the answer if it did what you needed.

ADD REPLY • link 4.6 years ago by Benn 8.3k