Question: Merging two files based on a particular column
1
gravatar for mohammedtoufiq91
29 days ago by
mohammedtoufiq9170 wrote:

Hi,

I have managed to merge two different files using a common column which has the same information using the below function:

Combine_Files <- merge(File1, File2, by="Symbols")

However, I would like to know if its possible to merge two different *.csv files i.e. one file (file 1) with the gene symbols only (one gene per row), and the other file (file 2) with the gene symbols and other annotation columns (many related genes per row separated by comma but one of the gene is common as (file 1). I would like to merge both the files based on the mapping of the gene symbols from file1 and extract other annotations columns from (file 2) in the combined file for further data analysis. I would like to know how this could be done.

For example: File_1

    Gene_Symbols

 1. GeneA1
 2. GeneA2
 3. GeneA3

For example: File_2

    Gene_Symbols

 1. GeneA1, GeneX1, GeneX2, GeneD
 2. GeneA2, GeneL1, GeneP2, GeneNA
 3. GeneB3, GeneA3, GeneLP1, GeneNA1

Other columns in File_2

Phenotype, GO Ontology, Pathways

Expected Output

Gene_Symbols, Phenotype, GO Ontology, Pathways

Thank you,

Toufiq

package function merge R files • 144 views
ADD COMMENTlink modified 29 days ago by i.sudbery6.0k • written 29 days ago by mohammedtoufiq9170
1

"Other columns" are they part of File2? If yes, could you update your example, and provide expected output.

ADD REPLYlink written 29 days ago by zx87548.3k

Yes, other columns are part of file 2. File 1 has one column and file 2 has four columns. The expected out should have 4 columns.

Does this example help?

Gene_Symbols_(file_1)

  1. GeneA1
  2. GeneA2
  3. GeneA3

Gene_Symbols_(file_2) Phenotype GO Ontology Pathways

  1. GeneA1,GeneX1,GeneX2,GeneD WO cell processes Pathway1

  2. GeneA2,GeneL1,GeneP2,GeneNA KT inhibition Pathway2

  3. GeneB3,GeneA3,GeneLP1,GeneNA1 WO1 activation Pathway3

Gene_Symbols_(Combined_file) Phenotype GO Ontology Pathways

  1. GeneA1 WO cell processes Pathway1

  2. GeneA2 KT inhibition Pathway2

  3. GeneA3 WO1 activation Pathway3

ADD REPLYlink modified 29 days ago • written 29 days ago by mohammedtoufiq9170

Maybe you should use the grep function of R to simply grep the gene symbol of file 1 to that in file 2 and if the grep is not null you paste the colums to that row. You will need a loop, probably (or maybe a clever use of apply).

ADD REPLYlink written 29 days ago by Fabio Marroni2.4k
4
gravatar for i.sudbery
29 days ago by
i.sudbery6.0k
Sheffield, UK
i.sudbery6.0k wrote:

Use the tidyr package. The function separate_rows will allow you to take you file_2 and split each row so that each row has only one gene symbol, and the information in the other columns is copied. Thus

GeneA1,GeneX1,GeneX2    WO cell processes Pathway 1

becomes

GeneA1    WO cell processes Pathway 1
GeneX1    WO cell processes Pathway 1
GeneX2    WO cell processes Pathway 1

You can then do your merge as before.

ADD COMMENTlink written 29 days ago by i.sudbery6.0k

Thank you. But I would like the output as below:

GeneA1 WO cell processes Pathway1

I would like to ignore unmatching genes (just retain the matching one's) or put them in other columns.

ADD REPLYlink written 29 days ago by mohammedtoufiq9170
1

You merge your file 1 with the file 2 obtained as i.sudbery suggested and you get what you want (merge by default discard non-matching rows)

ADD REPLYlink written 29 days ago by Fabio Marroni2.4k
2

Exactly.

If you merge

Symbol    Phenotype   Go Ontology   Pathway
GeneA1    WO cell processes Pathway 1
GeneX1    WO cell processes Pathway 1
GeneX2    WO cell processes Pathway 1

With

Symbol
GeneA1

You'll get

Symbol    Phenotype   Go Ontology   Pathway
GeneA1    WO cell processes Pathway 1
ADD REPLYlink written 29 days ago by i.sudbery6.0k

Thank you so much. This was helpful. I obtained the expected out.

ADD REPLYlink written 29 days ago by mohammedtoufiq9170
1

You can accept the answer if it did what you needed.

ADD REPLYlink written 29 days ago by Benn7.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2426 users visited in the last hour