Question: Clump gene entries for metagenomic data/ humann2
gravatar for xinyuanzhang22
7 months ago by
xinyuanzhang220 wrote:

Hi, I was doing some associations using gene abundance data from humann2 output, and got top results with hundreds of gene entries. By mapping them back to uniprot, I get the information on which species it come from and proteins they encode. Is there a way to clump all gene entries that encode the same proteins together? --for better annotation of the result.

A example of 4 gene entries from different organisms all encode for HTH cro/C1-type domain-containing protein. Thanks!

  • R5DNF1 Parabacteroides johnsonii CAG:246
  • R6K7I4 Eubacterium sp. CAG:252
  • H1CLY0 Lachnospiraceae bacterium 7_1_58FAA
  • D4C9T9 Clostridium sp. M62/1

click here for the screenshot from uniprot

ADD COMMENTlink modified 7 months ago by zorbax200 • written 7 months ago by xinyuanzhang220
gravatar for zorbax
7 months ago by
zorbax200 wrote:

If you have the 'Protein names' and 'Organism' columns in one table you can use Pandas.

df.groupby('Protein names')['Organism'].agg(lambda col: ','.join(col)).reset_index()

For this table:

class   order
bird    Falconiformes
bird    Psittaciformes
mammal  Carnivora
mammal  Primates
mammal  Carnivora

You will get something like this:

class   order
bird    Falconiformes,Psittaciformes
mammal  Carnivora,Primates,Carnivora
ADD COMMENTlink written 7 months ago by zorbax200

Hi Thanks! Just that the protein names are usually not exactly the same, so it's a bit hard to do it for all proteins...

ADD REPLYlink written 7 months ago by xinyuanzhang220
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1933 users visited in the last hour