Question

How to assign matching unique IDs to two datasets so that I can merge them correctly?

0

Entering edit mode

6.0 years ago

ishackm ▴ 110

Hi all,

I have the following two datasets:

ECM Proteomics Dataset

ECM Proteomics Data

ECM Isoform Dataset

I would like to merge the TGE amino acid sequence and the peptide sequence but these two dataset do not share a unique identifier for each row.

How can I please create a unique identifier for each row for each dataset so that I can correctly merge the right amino acid sequence to the correct peptide sequence?

enter image description here

For the ECM Proteomics Dataset, I did the following in Pandas. This created a unique ID based on all the columns

ECM['id'] = ECM.groupby(['Gene.Symbol','Division','Category','PI','Protein.Name..name.of.reference.protein.',
                           'Protein.description','Sequence..TGE.amino.acid.seq.']).ngroup()

How can I assign the same exact unique IDs to the same exact rows to the other dataset please?

If you can please give me examples in R or Python that would greatly be appreciated.

Kind Regards,

Ishack

r python • 3.5k views

ADD COMMENT • link 6.0 years ago by ishackm ▴ 110

0

Entering edit mode

enter image description here

Please show us what you've tried using R/python and we can help you get over any obstacles. We'll be unable to do your work for you, though.

ADD REPLY • link 6.0 years ago by Ram 45k

0

Entering edit mode

enter image description here Hi Ram,

Sorry about the incomplete post.

For the ECM Proteomics Dataset, I did the following in Pandas. This created a unique ID based on all the columns

ECM['id'] = ECM.groupby(['Gene.Symbol','Division','Category','PI','Protein.Name..name.of.reference.protein.',
                           'Protein.description','Sequence..TGE.amino.acid.seq.']).ngroup()

How can I assign the same exact unique IDs to the same exact rows to the other dataset please?

ADD REPLY • link 6.0 years ago by ishackm ▴ 110

0

Entering edit mode

I'm sorry, I don't know pandas. Maybe someone else that knows pandas can help you out. In the meanwhile, I'd recommend editing your post and adding the content from your comment in there.

ADD REPLY • link 6.0 years ago by Ram 45k

0

Entering edit mode

Ok, can you show me please how to do it in R please?

ADD REPLY • link 6.0 years ago by ishackm ▴ 110

0

Entering edit mode

Can anyone please help?

ADD REPLY • link 6.0 years ago by ishackm ▴ 110

1

Entering edit mode

ishackm: you will gain much respect by going away for a few days and trying this on your own. Asking questions like "Can anyone please help?" seem somewhat desperate (?) Fair is fair - we have all been where you are right now.

Edit: Although I say 'on your own', there is more than enough material on the World Wide Web for you to search and, in that process, self learn.

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

I do apologise Kevin, its just that I have been trying to solve this problem for 3 days now, but got nowhere, hence thats why I asked this question.

ADD REPLY • link 6.0 years ago by ishackm ▴ 110

0

Entering edit mode

All is fine. Do not worry.

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

So is there any library in R that can help me do this?

I want to merge every amino acid sequence with all the possible peptides related to that particular gene

ADD REPLY • link 6.0 years ago by ishackm ▴ 110

1

Entering edit mode

Hey dude / dudette. I was working. If I had to do this in R, I would, first, find the reference data that I need outside R, input this to R, and then do the processing there.

These also look promising:

In fact, there seems to be 'Pep' this and 'pep' that... lots of programs. That's Pep-tastic!

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

I'm not sure I get it. Can you give an example of two lines (one from each file) that you would merge and tell us based on which information you would want to merge it? I am guessing, the protein ID will be of significance...?

ADD REPLY • link 6.0 years ago by Friederike 9.0k

0

Entering edit mode

For example, this is one isoform (ECM TGE)

I would like to merge same isoform with all the A2M peptides:

So result is like this please:

enter image description here

I would like something like this for every gene please.

same amino acid sequence compared with each peptide

How can I do that please?

ADD REPLY • link 6.0 years ago by ishackm ▴ 110

0

Entering edit mode

Both datasets have Gene.Symbol. Can't you use that to extract column from other dataset ? My guess is both datasets have differing number of rows per Gene.Symbol, so you may not be exactly able to combine two datasets, but at least you will be able to separate info per gene.

ADD REPLY • link 6.0 years ago by prabin.dm ▴ 260

score 2 · Answer 1 · 2019-06-26

Maybe something like this:

## making dummy data -- you would have to read in those tab files (SEE MY NOTE!)
> df1 <- data.frame(acc_no = c("RL4_HUMAN", "A2M_HUMAN"), peptide = c("AAAA","BBBB"), uniprot = c("bla","bli"), gene_symbol = c("RL4;","A2M;"))
> df1
   acc_no peptide uniprot gene_symbol
1 RL4_HUMAN    AAAA     bla         RL4;
2 A2M_HUMAN    BBBB     bli        A2M;

> df2 <- data.frame(gene.symbol = c("A2M", "A2M","A2M","STIP1"), Sequence..TGE.amino.acid.seq = c("MNGL","MNGL","MNGL","OTL"))

## need to make sure that the semicolons aren't going to mess with our merge
> df1$gene_symbol <- gsub(";$","", df1$gene_symbol)
> df1
      acc_no peptide uniprot gene_symbol
1  RL4_HUMAN    AAAA     bla         RL4
2 ATM2_HUMAN    BBBB     bli        ATM2

## now we can merge specifying the respective columns that contain the gene symbols
## there are different ways of doing that, e.g.:
> > merge(df2, df1, by.x = "gene.symbol", by.y = "gene_symbol")
  gene.symbol Sequence..TGE.amino.acid.seq    acc_no peptide uniprot
1         A2M                         MNGL A2M_HUMAN    BBBB     bli
2         A2M                         MNGL A2M_HUMAN    BBBB     bli
3         A2M                         MNGL A2M_HUMAN    BBBB     bli

## ...or:
> merge(df2, df1, by.x = "gene.symbol", by.y = "gene_symbol", all = TRUE)
  gene.symbol Sequence..TGE.amino.acid.seq    acc_no peptide uniprot
1         A2M                         MNGL A2M_HUMAN    BBBB     bli
2         A2M                         MNGL A2M_HUMAN    BBBB     bli
3         A2M                         MNGL A2M_HUMAN    BBBB     bli
4       STIP1                          OTL      <NA>    <NA>    <NA>
5         RL4                         <NA> RL4_HUMAN    AAAA     bla

NOTES:

I saw that you opened the files in Excel -- do you see how the gene symbol is all garbled up in line 21? That's Excel's doing and I highly recommend you download the original files again. See this post for more details.
You can use the read.table function to read the data into R.
Read the help of the merge function to educate yourself about all the options it has (?merge).
I have no idea whether my solution is what you want because the example you gave above was not detailed enough.