Question

Tools for raw reads that estimate the percentage of reads that hit marker genes using protein BLAST or more advanced approaches based on protein HMM profiles of marker PFAM domains

0

Entering edit mode

1 day ago

shevch2009 ▴ 20

Hello, all!

I am looking for some tools, specifically for raw reads (shougun) that estimate the percentage of reads that hit marker genes using protein BLAST or more advanced approaches based on protein HMM profiles of marker PFAM domains. Are there any pipelines for this kind of aim?

I got an advice to use humann, but it's not working, I have opened the topic - https://forum.biobakery.org/t/errors-with-humann-metaphlan/8447

But, it looks like there will be no responses, and the issue is not new.

Most of tools I have checked - workes on mags or contigs only.

I would appreciate any suggestions,

Thanks, Best, Alla

data raw reads shotgun • 2.7k views

ADD COMMENT • link updated 3 hours ago by Mensur Dlakic ★ 29k • written 1 day ago by shevch2009 ▴ 20

0

Entering edit mode

If you just want fast translated sequences you could use DIAMOND (https://github.com/bbuchfink/diamond ) as long as you have the necessary resources available.

ADD REPLY • link 1 day ago by GenoMax 153k

0

Entering edit mode

Thanks, But it looks like the input file is one and should be not the raw paired reads, i think it's more for contigs.

ADD REPLY • link 17 hours ago by shevch2009 ▴ 20

1

Entering edit mode

Only one of the files should be enough to give you an idea of what that fragment (read-pair) likely is from, which is what you seem to want to know.

ADD REPLY • link 15 hours ago by GenoMax 153k

0

Entering edit mode

I considered mapping the reads to a database, but that approach wouldn't provide the gene names or the metabolic pathways the genes participate in. What I really need is the percentage of reads that hit marker genes and the names of those genes and pathways. Diamond seems don't provide such output files.

Most of tools workes with contigs or mags not the raw reads, that seems the issue.

ADD REPLY • link 11 hours ago by shevch2009 ▴ 20

0

Entering edit mode

Then you need to decide if you want to assemble your reads into contigs/MAG's followed by using the tools discussed. Or do the mapping using reads and then convert the results into KEGG ID or pathways.

HUMAnN is probably the ideal tool. I don't quite understand the problem you noted in the thread mentioned above. You may want to go back and try running v.3.x of HUMAnN. It may be sufficient for your needs.

ADD REPLY • link 10 hours ago by GenoMax 153k

0

Entering edit mode

Well, we already have mags, but my boss want to check the raw reads, that is why I am trying to find a way.

Well, human got me errors, it uses metaphlan, when I have installed it -it showed errors when I tried to run it, firsts was - unrecognized arguments: --bowtie2out - it looks like the metaphlan version is wrong for human version,

after I have tried to fix it - i got error that - The MetaPhlAn taxonomic profile provided does not contain the database version vOct22_CHOCOPhlAnSGB_202403 in any of its header lines.

Although, I have downloaded the database.

I will try to reinstall it.

Thanks

ADD REPLY • link 10 hours ago by shevch2009 ▴ 20

score 0 · Answer 1 · 2025-09-09

0

Entering edit mode

13 hours ago

colindaven 7.8k

I believe Mgnify at the EBI https://www.ebi.ac.uk/metagenomics does something like this for you using domains. But I have not used it, and you have to upload your data there first.

ADD COMMENT • link 13 hours ago by colindaven 7.8k

0

Entering edit mode

A quick look suggests that this tool is expecting longer sequences (if the example is to go by in sequence search) that need to be protein (at least for the sequence search tool). Not what OP has.

ADD REPLY • link 13 hours ago by GenoMax 153k

0

Entering edit mode

I don't think any kind of web-based tool will work; the read files are very large (at least 10 GB). Usually, web versions have restrictions on the input file size or the number of sequences that can be processed.

ADD REPLY • link 11 hours ago by shevch2009 ▴ 20

score 0 · Answer 2 · 2025-09-09

I have used BURST (https://github.com/knights-lab/BURST) to do this before. You can take fastA sequences for your target genes and build out a reference db, and BURST will give all hits above a % identity. You need to post-process this to transform hits to percentages. Unfortunately it is largely abandoned.

score 0 · Answer 3 · 2025-09-09

If you want an overall taxonomic profile, this tool will do it from reads directly:

https://github.com/DessimozLab/read2tree

Well, we already have mags, but my boss want to check the raw reads, that is why I am trying to find a way.

This is like saying "I have a Mercedes, but I still want to try how it feels to drive a MINI Cooper." The only information you could potentially get out of reads that you don't already have in MAGs is about species that are so poorly sequenced that they don't group into a bin. Those are usually viruses and microbes of extremely low abundance. Alternatively, those may be low-abundance strains of your existing MAGs that are different enough not to assemble with their main MAG. Either way, it is unlikely that you can make a strong case from it.