Merging two files based on a particular column
1
1
Entering edit mode
5.1 years ago

Hi,

I have managed to merge two different files using a common column which has the same information using the below function:

Combine_Files <- merge(File1, File2, by="Symbols")

However, I would like to know if its possible to merge two different *.csv files i.e. one file (file 1) with the gene symbols only (one gene per row), and the other file (file 2) with the gene symbols and other annotation columns (many related genes per row separated by comma but one of the gene is common as (file 1). I would like to merge both the files based on the mapping of the gene symbols from file1 and extract other annotations columns from (file 2) in the combined file for further data analysis. I would like to know how this could be done.

For example: File_1

    Gene_Symbols
 1. GeneA1
 2. GeneA2
 3. GeneA3

For example: File_2

    Gene_Symbols
 1. GeneA1, GeneX1, GeneX2, GeneD
 2. GeneA2, GeneL1, GeneP2, GeneNA
 3. GeneB3, GeneA3, GeneLP1, GeneNA1

Other columns in File_2

Phenotype, GO Ontology, Pathways

Expected Output

Gene_Symbols, Phenotype, GO Ontology, Pathways

Thank you,
Toufiq

r • 1.8k views
ADD COMMENT
1
Entering edit mode

"Other columns" are they part of File2? If yes, could you update your example, and provide expected output.

ADD REPLY
0
Entering edit mode

Yes, other columns are part of file 2. File 1 has one column and file 2 has four columns. The expected out should have 4 columns.

Does this example help?

Gene_Symbols_(file_1)

  1. GeneA1
  2. GeneA2
  3. GeneA3

Gene_Symbols_(file_2) Phenotype GO Ontology Pathways

  1. GeneA1,GeneX1,GeneX2,GeneD WO cell processes Pathway1

  2. GeneA2,GeneL1,GeneP2,GeneNA KT inhibition Pathway2

  3. GeneB3,GeneA3,GeneLP1,GeneNA1 WO1 activation Pathway3

Gene_Symbols_(Combined_file) Phenotype GO Ontology Pathways

  1. GeneA1 WO cell processes Pathway1

  2. GeneA2 KT inhibition Pathway2

  3. GeneA3 WO1 activation Pathway3

ADD REPLY
0
Entering edit mode

Maybe you should use the grep function of R to simply grep the gene symbol of file 1 to that in file 2 and if the grep is not null you paste the colums to that row. You will need a loop, probably (or maybe a clever use of apply).

ADD REPLY
4
Entering edit mode
5.1 years ago

Use the tidyr package. The function separate_rows will allow you to take you file_2 and split each row so that each row has only one gene symbol, and the information in the other columns is copied. Thus

GeneA1,GeneX1,GeneX2    WO cell processes Pathway 1

becomes

GeneA1    WO cell processes Pathway 1
GeneX1    WO cell processes Pathway 1
GeneX2    WO cell processes Pathway 1

You can then do your merge as before.

ADD COMMENT
0
Entering edit mode

Thank you. But I would like the output as below:

GeneA1 WO cell processes Pathway1

I would like to ignore unmatching genes (just retain the matching one's) or put them in other columns.

ADD REPLY
1
Entering edit mode

You merge your file 1 with the file 2 obtained as i.sudbery suggested and you get what you want (merge by default discard non-matching rows)

ADD REPLY
2
Entering edit mode

Exactly.

If you merge

Symbol    Phenotype   Go Ontology   Pathway
GeneA1    WO cell processes Pathway 1
GeneX1    WO cell processes Pathway 1
GeneX2    WO cell processes Pathway 1

With

Symbol
GeneA1

You'll get

Symbol    Phenotype   Go Ontology   Pathway
GeneA1    WO cell processes Pathway 1
ADD REPLY
0
Entering edit mode

Thank you so much. This was helpful. I obtained the expected out.

ADD REPLY
1
Entering edit mode

You can accept the answer if it did what you needed.

ADD REPLY

Login before adding your answer.

Traffic: 1963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6