Question: How to assign matching unique IDs to two datasets so that I can merge them correctly?
0
gravatar for ishackm
26 days ago by
ishackm70
ishackm70 wrote:

Hi all,

I have the following two datasets:

ECM Proteomics Dataset

ECM Proteomics Data

ECM Isoform Dataset

I would like to merge the TGE amino acid sequence and the peptide sequence but these two dataset do not share a unique identifier for each row.

How can I please create a unique identifier for each row for each dataset so that I can correctly merge the right amino acid sequence to the correct peptide sequence?

enter image description here

For the ECM Proteomics Dataset, I did the following in Pandas. This created a unique ID based on all the columns

ECM['id'] = ECM.groupby(['Gene.Symbol','Division','Category','PI','Protein.Name..name.of.reference.protein.',
                           'Protein.description','Sequence..TGE.amino.acid.seq.']).ngroup()

How can I assign the same exact unique IDs to the same exact rows to the other dataset please?

If you can please give me examples in R or Python that would greatly be appreciated.

Kind Regards,

Ishack

python R • 225 views
ADD COMMENTlink modified 26 days ago • written 26 days ago by ishackm70

enter image description here

Please show us what you've tried using R/python and we can help you get over any obstacles. We'll be unable to do your work for you, though.

ADD REPLYlink modified 26 days ago • written 26 days ago by RamRS22k

enter image description hereHi Ram,

Sorry about the incomplete post.

For the ECM Proteomics Dataset, I did the following in Pandas. This created a unique ID based on all the columns

ECM['id'] = ECM.groupby(['Gene.Symbol','Division','Category','PI','Protein.Name..name.of.reference.protein.',
                           'Protein.description','Sequence..TGE.amino.acid.seq.']).ngroup()

How can I assign the same exact unique IDs to the same exact rows to the other dataset please?

ADD REPLYlink modified 26 days ago • written 26 days ago by ishackm70

I'm sorry, I don't know pandas. Maybe someone else that knows pandas can help you out. In the meanwhile, I'd recommend editing your post and adding the content from your comment in there.

ADD REPLYlink written 26 days ago by RamRS22k

Ok, can you show me please how to do it in R please?

ADD REPLYlink written 26 days ago by ishackm70

Can anyone please help?

ADD REPLYlink written 26 days ago by ishackm70
1

ishackm: you will gain much respect by going away for a few days and trying this on your own. Asking questions like "Can anyone please help?" seem somewhat desperate (?) Fair is fair - we have all been where you are right now.

Edit: Although I say 'on your own', there is more than enough material on the World Wide Web for you to search and, in that process, self learn.

ADD REPLYlink modified 26 days ago • written 26 days ago by Kevin Blighe45k

I do apologise Kevin, its just that I have been trying to solve this problem for 3 days now, but got nowhere, hence thats why I asked this question.

ADD REPLYlink written 26 days ago by ishackm70

All is fine. Do not worry.

ADD REPLYlink written 26 days ago by Kevin Blighe45k

So is there any library in R that can help me do this?

I want to merge every amino acid sequence with all the possible peptides related to that particular gene

ADD REPLYlink written 26 days ago by ishackm70
1

Hey dude / dudette. I was working. If I had to do this in R, I would, first, find the reference data that I need outside R, input this to R, and then do the processing there.

These also look promising:

In fact, there seems to be 'Pep' this and 'pep' that... lots of programs. That's Pep-tastic!

ADD REPLYlink written 26 days ago by Kevin Blighe45k

I'm not sure I get it. Can you give an example of two lines (one from each file) that you would merge and tell us based on which information you would want to merge it? I am guessing, the protein ID will be of significance...?

ADD REPLYlink written 26 days ago by Friederike4.5k

For example, this is one isoform (ECM TGE)

I would like to merge same isoform with all the A2M peptides:

So result is like this please:

enter image description here

I would like something like this for every gene please.

same amino acid sequence compared with each peptide

How can I do that please?

ADD REPLYlink modified 26 days ago • written 26 days ago by ishackm70

Both datasets have Gene.Symbol. Can't you use that to extract column from other dataset ? My guess is both datasets have differing number of rows per Gene.Symbol, so you may not be exactly able to combine two datasets, but at least you will be able to separate info per gene.

ADD REPLYlink written 25 days ago by prabin.dm160
2
gravatar for Friederike
25 days ago by
Friederike4.5k
United States
Friederike4.5k wrote:

Maybe something like this:

## making dummy data -- you would have to read in those tab files (SEE MY NOTE!)
> df1 <- data.frame(acc_no = c("RL4_HUMAN", "A2M_HUMAN"), peptide = c("AAAA","BBBB"), uniprot = c("bla","bli"), gene_symbol = c("RL4;","A2M;"))
> df1
   acc_no peptide uniprot gene_symbol
1 RL4_HUMAN    AAAA     bla         RL4;
2 A2M_HUMAN    BBBB     bli        A2M;

> df2 <- data.frame(gene.symbol = c("A2M", "A2M","A2M","STIP1"), Sequence..TGE.amino.acid.seq = c("MNGL","MNGL","MNGL","OTL"))

## need to make sure that the semicolons aren't going to mess with our merge
> df1$gene_symbol <- gsub(";$","", df1$gene_symbol)
> df1
      acc_no peptide uniprot gene_symbol
1  RL4_HUMAN    AAAA     bla         RL4
2 ATM2_HUMAN    BBBB     bli        ATM2

## now we can merge specifying the respective columns that contain the gene symbols
## there are different ways of doing that, e.g.:
> > merge(df2, df1, by.x = "gene.symbol", by.y = "gene_symbol")
  gene.symbol Sequence..TGE.amino.acid.seq    acc_no peptide uniprot
1         A2M                         MNGL A2M_HUMAN    BBBB     bli
2         A2M                         MNGL A2M_HUMAN    BBBB     bli
3         A2M                         MNGL A2M_HUMAN    BBBB     bli

## ...or:
> merge(df2, df1, by.x = "gene.symbol", by.y = "gene_symbol", all = TRUE)
  gene.symbol Sequence..TGE.amino.acid.seq    acc_no peptide uniprot
1         A2M                         MNGL A2M_HUMAN    BBBB     bli
2         A2M                         MNGL A2M_HUMAN    BBBB     bli
3         A2M                         MNGL A2M_HUMAN    BBBB     bli
4       STIP1                          OTL      <NA>    <NA>    <NA>
5         RL4                         <NA> RL4_HUMAN    AAAA     bla

NOTES:

  1. I saw that you opened the files in Excel -- do you see how the gene symbol is all garbled up in line 21? That's Excel's doing and I highly recommend you download the original files again. See this post for more details.
  2. You can use the read.table function to read the data into R.
  3. Read the help of the merge function to educate yourself about all the options it has (?merge).
  4. I have no idea whether my solution is what you want because the example you gave above was not detailed enough.
ADD COMMENTlink written 25 days ago by Friederike4.5k

Thanks, Friederike, after many days it finally works! Thanks to all as well.

ADD REPLYlink written 20 days ago by ishackm70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 801 users visited in the last hour