Clump gene entries for metagenomic data/ humann2
1
0
Entering edit mode
4.2 years ago

Hi, I was doing some associations using gene abundance data from humann2 output, and got top results with hundreds of gene entries. By mapping them back to uniprot, I get the information on which species it come from and proteins they encode. Is there a way to clump all gene entries that encode the same proteins together? --for better annotation of the result.

A example of 4 gene entries from different organisms all encode for HTH cro/C1-type domain-containing protein. Thanks!

  • R5DNF1 Parabacteroides johnsonii CAG:246
  • R6K7I4 Eubacterium sp. CAG:252
  • H1CLY0 Lachnospiraceae bacterium 7_1_58FAA
  • D4C9T9 Clostridium sp. M62/1

click here for the screenshot from uniprot

uniprot metagenomics humann2 • 840 views
ADD COMMENT
0
Entering edit mode
4.1 years ago
zorbax ▴ 610

If you have the 'Protein names' and 'Organism' columns in one table you can use Pandas.

df.groupby('Protein names')['Organism'].agg(lambda col: ','.join(col)).reset_index()

For this table:

class   order
bird    Falconiformes
bird    Psittaciformes
mammal  Carnivora
mammal  Primates
mammal  Carnivora

You will get something like this:

class   order
bird    Falconiformes,Psittaciformes
mammal  Carnivora,Primates,Carnivora
ADD COMMENT
0
Entering edit mode

Hi Thanks! Just that the protein names are usually not exactly the same, so it's a bit hard to do it for all proteins...

ADD REPLY

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6