Question

Why are some HUMAnN features in [name]_pathabundance.tsv missing taxonomy?

0

Entering edit mode

6 months ago

O.rka ▴ 710

I'm using HUMAnN 3.8

I made a custom HUMAnN protein database using UniRef50 mappings. My command was the following:

INPUT=veba_output/preprocess/S4/output/joined.fastq.gz
OUTPUT=test_output
DMND_DB_DIR=veba_output/misc/diamond_database/
NUM_THREADS=1

humann --input ${INPUT} --output ${OUTPUT} --protein-database ${DMND_DB_DIR} --threads ${NUM_THREADS} --bypass-nucleotide-search --input-format fastq.gz --translated-identity-threshold 50 --translated-query-coverage-threshold 80 --search-mode uniref50 --id-mapping  veba_output/misc/humann_uniref_annotations.tsv

Everything ran as expected however, I'm not sure why the output is missing taxonomy when each protein contains the taxonomy info.

(base) [jespinoz@login02 TestVEBA]$ head veba_output/misc/humann_uniref_annotations.tsv
S1__NODE_166_length_33212_cov_10.244534_32964:33209(+)  UniRef50_Q41093 82  c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_229_length_28347_cov_10.467447_5060:5751(-)    UniRef50_Q41093 196 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_298_length_24477_cov_10.374539_11070:11663(+)  UniRef50_Q41093 198 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_56_length_52126_cov_10.433043_49957:50562(+)   UniRef50_Q41093 202 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_56_length_52126_cov_10.433043_50725:52123(-)   UniRef50_Q41093 338 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_743_length_9370_cov_9.617391_5:550(+)  UniRef50_Q41093 182 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_14_length_79184_cov_10.324875_17207:17686(+)   UniRef50_A0A1Z5JXG7 160 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_646_length_11435_cov_10.486731_5428:5952(-)    UniRef50_A0A1Z5JXG7 175 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_674_length_10805_cov_9.362977_3661:4200(-) UniRef50_A0A1Z5JXG7 180 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_8_length_86326_cov_10.138192_57293:57772(+)    UniRef50_A0A1Z5JXG7 160 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1

Check it out:

(base) [jespinoz@login02 TestVEBA]$ head -n 20 test_output/joined_pathabundance.tsv
# Pathway   joined_Abundance
UNMAPPED    280562.1981557097
UNINTEGRATED    503525.8301515482
UNINTEGRATED|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2    557698.9739954222
UNINTEGRATED|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1    22571.9607414017
UNINTEGRATED|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S2__METABAT2__E.1__bin.2    22562.8012692604
UNINTEGRATED|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__Aliishimia;s__;t__S3__METABAT2__P.1__bin.4  9215.3696142461
UNINTEGRATED|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__JALEFY01;s__;t__S3__MAXBIN2-107__P.1__bin.005   820.2588789283
LIPASYN-PWY: phospholipases 240.5774367950
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2 196.3624498706
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1 22.1074934622
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S2__METABAT2__E.1__bin.2 22.1074934622
PANTO-PWY: phosphopantothenate biosynthesis I   137.1807208031
PWY-7208: superpathway of pyrimidine nucleobases salvage    133.8841119258
PWY-5667: CDP-diacylglycerol biosynthesis I 111.6144003857
PWY0-1319: CDP-diacylglycerol biosynthesis II   111.6144003857
P41-PWY: pyruvate fermentation to acetate and (S)-lactate I 102.5586344296
PWY-5100: pyruvate fermentation to acetate and lactate II   102.5586344296
PWY-5100: pyruvate fermentation to acetate and lactate II|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2   83.8098433960
SER-GLYSYN-PWY: superpathway of L-serine and glycine biosynthesis I 97.7258146256

LIPASYN-PWY: phospholipases makes sense because it's literally just the sum of the other LIPASYN-PWY: phospholipases that have taxonomy.

However, this one doesn't have any taxonomy anywhere in the file:

PANTO-PWY: phosphopantothenate biosynthesis I   137.1807208031

Then there is this one:

(base) [jespinoz@login02 TestVEBA]$ cat test_output/joined_pathabundance.tsv | grep "PWY-3781"
PWY-3781: aerobic respiration I (cytochrome c)  36.1384257683
PWY-3781: aerobic respiration I (cytochrome c)|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__Aliishimia;s__;t__S3__METABAT2__P.1__bin.4    17.5489594619
PWY-3781: aerobic respiration I (cytochrome c)|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__JALEFY01;s__;t__S3__MAXBIN2-107__P.1__bin.005 2.3236692546

Where the ones with taxonomy don't sum up to the one without taxonomy.

My main question is why some don't have taxonomy assigned to them when all of the protein identifiers are linked to a specific taxa in the --id-mapping table.

metagenomics metabolism humann biobakery metatranscriptomics • 545 views

ADD COMMENT • link 6 months ago by O.rka ▴ 710

score 0 · Answer 1 · 2023-10-12

0

Entering edit mode

6 months ago

yl759 ▴ 120

The answer to your question from HUMAnN GitHub page:

Unlike gene abundance, a pathway's community-level abundance is not necessarily the sum of its stratified abundance values. For example, continuing with the simple linear pathway example introduced above, if the abundances of RXNs 1-4 are [5, 5, 10, 10] in Species_A and [10, 10, 5, 5] in Species_B, HUMAnN would report that Species_A and Species_B each contribute 5 complete copies of the pathway. However, at the community level, the reaction totals are [15, 15, 15, 15], and thus HUMAnN would report 15 complete copies.

ADD COMMENT • link 6 months ago by yl759 ▴ 120

0

Entering edit mode

Thanks for looking into this. However, it still doesn't explain why there isn't any taxonomy coupled to some of the pathways. I understand why the values don't sum up but i'm not sure why taxonomy is missing when it's been provided.

ADD REPLY • link 6 months ago by O.rka ▴ 710