Question: Clump gene entries for metagenomic data/ humann2
0
gravatar for xinyuanzhang22
5 weeks ago by
xinyuanzhang220 wrote:

Hi, I was doing some associations using gene abundance data from humann2 output, and got top results with hundreds of gene entries. By mapping them back to uniprot, I get the information on which species it come from and proteins they encode. Is there a way to clump all gene entries that encode the same proteins together? --for better annotation of the result.

A example of 4 gene entries from different organisms all encode for HTH cro/C1-type domain-containing protein. Thanks!

  • R5DNF1 Parabacteroides johnsonii CAG:246
  • R6K7I4 Eubacterium sp. CAG:252
  • H1CLY0 Lachnospiraceae bacterium 7_1_58FAA
  • D4C9T9 Clostridium sp. M62/1

click here for the screenshot from uniprot

ADD COMMENTlink modified 4 weeks ago by zorbax40 • written 5 weeks ago by xinyuanzhang220
0
gravatar for zorbax
4 weeks ago by
zorbax40
Mexico
zorbax40 wrote:

If you have the 'Protein names' and 'Organism' columns in one table you can use Pandas.

df.groupby('Protein names')['Organism'].agg(lambda col: ','.join(col)).reset_index()

For this table:

class   order
bird    Falconiformes
bird    Psittaciformes
mammal  Carnivora
mammal  Primates
mammal  Carnivora

You will get something like this:

class   order
bird    Falconiformes,Psittaciformes
mammal  Carnivora,Primates,Carnivora
ADD COMMENTlink written 4 weeks ago by zorbax40

Hi Thanks! Just that the protein names are usually not exactly the same, so it's a bit hard to do it for all proteins...

ADD REPLYlink written 4 weeks ago by xinyuanzhang220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1201 users visited in the last hour