I'm using HUMAnN 3.8
I made a custom HUMAnN protein database using UniRef50 mappings. My command was the following:
INPUT=veba_output/preprocess/S4/output/joined.fastq.gz
OUTPUT=test_output
DMND_DB_DIR=veba_output/misc/diamond_database/
NUM_THREADS=1
humann --input ${INPUT} --output ${OUTPUT} --protein-database ${DMND_DB_DIR} --threads ${NUM_THREADS} --bypass-nucleotide-search --input-format fastq.gz --translated-identity-threshold 50 --translated-query-coverage-threshold 80 --search-mode uniref50 --id-mapping veba_output/misc/humann_uniref_annotations.tsv
Everything ran as expected however, I'm not sure why the output is missing taxonomy when each protein contains the taxonomy info.
(base) [jespinoz@login02 TestVEBA]$ head veba_output/misc/humann_uniref_annotations.tsv
S1__NODE_166_length_33212_cov_10.244534_32964:33209(+) UniRef50_Q41093 82 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_229_length_28347_cov_10.467447_5060:5751(-) UniRef50_Q41093 196 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_298_length_24477_cov_10.374539_11070:11663(+) UniRef50_Q41093 198 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_56_length_52126_cov_10.433043_49957:50562(+) UniRef50_Q41093 202 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_56_length_52126_cov_10.433043_50725:52123(-) UniRef50_Q41093 338 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_743_length_9370_cov_9.617391_5:550(+) UniRef50_Q41093 182 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_14_length_79184_cov_10.324875_17207:17686(+) UniRef50_A0A1Z5JXG7 160 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_646_length_11435_cov_10.486731_5428:5952(-) UniRef50_A0A1Z5JXG7 175 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_674_length_10805_cov_9.362977_3661:4200(-) UniRef50_A0A1Z5JXG7 180 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
S1__NODE_8_length_86326_cov_10.138192_57293:57772(+) UniRef50_A0A1Z5JXG7 160 c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1
Check it out:
(base) [jespinoz@login02 TestVEBA]$ head -n 20 test_output/joined_pathabundance.tsv
# Pathway joined_Abundance
UNMAPPED 280562.1981557097
UNINTEGRATED 503525.8301515482
UNINTEGRATED|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2 557698.9739954222
UNINTEGRATED|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1 22571.9607414017
UNINTEGRATED|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S2__METABAT2__E.1__bin.2 22562.8012692604
UNINTEGRATED|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__Aliishimia;s__;t__S3__METABAT2__P.1__bin.4 9215.3696142461
UNINTEGRATED|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__JALEFY01;s__;t__S3__MAXBIN2-107__P.1__bin.005 820.2588789283
LIPASYN-PWY: phospholipases 240.5774367950
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2 196.3624498706
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S1__METABAT2__E.1__bin.1 22.1074934622
LIPASYN-PWY: phospholipases|c__Bacillariophyceae;o__Naviculales;f__Phaeodactylaceae;g__Phaeodactylum;s__Phaeodactylum tricornutum [CCAP 1055/1];t__S2__METABAT2__E.1__bin.2 22.1074934622
PANTO-PWY: phosphopantothenate biosynthesis I 137.1807208031
PWY-7208: superpathway of pyrimidine nucleobases salvage 133.8841119258
PWY-5667: CDP-diacylglycerol biosynthesis I 111.6144003857
PWY0-1319: CDP-diacylglycerol biosynthesis II 111.6144003857
P41-PWY: pyruvate fermentation to acetate and (S)-lactate I 102.5586344296
PWY-5100: pyruvate fermentation to acetate and lactate II 102.5586344296
PWY-5100: pyruvate fermentation to acetate and lactate II|c__Bacillariophyceae;o__Bacillariales;f__Bacillariaceae;g__;s__;t__S4__METABAT2__E.1__bin.2 83.8098433960
SER-GLYSYN-PWY: superpathway of L-serine and glycine biosynthesis I 97.7258146256
LIPASYN-PWY: phospholipases
makes sense because it's literally just the sum of the other LIPASYN-PWY: phospholipases
that have taxonomy.
However, this one doesn't have any taxonomy anywhere in the file:
PANTO-PWY: phosphopantothenate biosynthesis I 137.1807208031
Then there is this one:
(base) [jespinoz@login02 TestVEBA]$ cat test_output/joined_pathabundance.tsv | grep "PWY-3781"
PWY-3781: aerobic respiration I (cytochrome c) 36.1384257683
PWY-3781: aerobic respiration I (cytochrome c)|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__Aliishimia;s__;t__S3__METABAT2__P.1__bin.4 17.5489594619
PWY-3781: aerobic respiration I (cytochrome c)|d__Bacteria;p__Pseudomonadota;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__JALEFY01;s__;t__S3__MAXBIN2-107__P.1__bin.005 2.3236692546
Where the ones with taxonomy don't sum up to the one without taxonomy.
My main question is why some don't have taxonomy assigned to them when all of the protein identifiers are linked to a specific taxa in the --id-mapping table.
Thanks for looking into this. However, it still doesn't explain why there isn't any taxonomy coupled to some of the pathways. I understand why the values don't sum up but i'm not sure why taxonomy is missing when it's been provided.