Question: Clump gene entries for metagenomic data/ humann2
0
gravatar for xinyuanzhang22
7 months ago by
xinyuanzhang220 wrote:

Hi, I was doing some associations using gene abundance data from humann2 output, and got top results with hundreds of gene entries. By mapping them back to uniprot, I get the information on which species it come from and proteins they encode. Is there a way to clump all gene entries that encode the same proteins together? --for better annotation of the result.

A example of 4 gene entries from different organisms all encode for HTH cro/C1-type domain-containing protein. Thanks!

  • R5DNF1 Parabacteroides johnsonii CAG:246
  • R6K7I4 Eubacterium sp. CAG:252
  • H1CLY0 Lachnospiraceae bacterium 7_1_58FAA
  • D4C9T9 Clostridium sp. M62/1

click here for the screenshot from uniprot

ADD COMMENTlink modified 7 months ago by zorbax200 • written 7 months ago by xinyuanzhang220
0
gravatar for zorbax
7 months ago by
zorbax200
Mexico
zorbax200 wrote:

If you have the 'Protein names' and 'Organism' columns in one table you can use Pandas.

df.groupby('Protein names')['Organism'].agg(lambda col: ','.join(col)).reset_index()

For this table:

class   order
bird    Falconiformes
bird    Psittaciformes
mammal  Carnivora
mammal  Primates
mammal  Carnivora

You will get something like this:

class   order
bird    Falconiformes,Psittaciformes
mammal  Carnivora,Primates,Carnivora
ADD COMMENTlink written 7 months ago by zorbax200

Hi Thanks! Just that the protein names are usually not exactly the same, so it's a bit hard to do it for all proteins...

ADD REPLYlink written 7 months ago by xinyuanzhang220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1933 users visited in the last hour