Question

Metagenomic classification by custom database

0

Entering edit mode

5.7 years ago

mikael.lenz.strube ▴ 60

Hi,

I'm looking for a metagenomic classifier which will classify paired end illumina reads using a custom fasta-file as a database. I have been looking at a bunch of them and they all are very concerned with taxonomy, which is not relevant - or exists - in this particular case (Centrifuge is doing a fine job of that).

Ideally the output is a table of abundance for each entry in the database.

Any such thing?

sequencing classification • 2.4k views

ADD COMMENT • link updated 5.7 years ago by Carambakaracho ★ 3.2k • written 5.7 years ago by mikael.lenz.strube ▴ 60

1

Entering edit mode

5.7 years ago

lakhujanivijay 5.8k

You could try Kaiju. Check the github page here

Read under custom database section

ADD COMMENT • link 5.7 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Hi Vijay,

I looked at Kaiju, but it requires NCBI taxon identifiers for custom databases, which i don't have.

ADD REPLY • link 5.7 years ago by mikael.lenz.strube ▴ 60

0

Entering edit mode

You can find all sorts of downloads related to NCBI Taxonomy here.

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

Hi genomax,

the issue is that i have no taxonomy in my own database, they potentially have arbitrary and anonymous headers.

ADD REPLY • link 5.7 years ago by mikael.lenz.strube ▴ 60

0

Entering edit mode

Hey, I just checked it again and it says that it doesn't need the taxonomic classification. So may be you can give it a try.

ADD REPLY • link 5.7 years ago by harish ▴ 450

0

Entering edit mode

I'm not sure, it says

It is also possible to make a custom database from a collection of protein sequences. The format needs to be a FASTA file in which the headers are the numeric NCBI taxon identifiers of the protein sequences, which can optionally be prefixed by another identifier (e.g. a counter) followed by an underscore, for example:

Am I misunderstanding something?

ADD REPLY • link 5.7 years ago by mikael.lenz.strube ▴ 60

score 3 · Accepted Answer · 2018-08-13

3

Entering edit mode

5.7 years ago

Carambakaracho ★ 3.2k

To me it sounds all you need is some super fast blast like functionality. In case I didn't misinterpret your request try either Benjamin Buchfink's diamond (against protein database) or NCBI magicblast (against DNA database).

The abundance table then is a simple script using hash/dictionaries or something similar.

ADD COMMENT • link 5.7 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

Magicblast seems to be what i'm looking for, thanks a lot!

ADD REPLY • link 5.7 years ago by mikael.lenz.strube ▴ 60

0

Entering edit mode

Hi Mikael, fyi, general good practice is to mark helpful answers with thumbs up and in case it was the correct answer with accepted answer, so others can spot the best solutions faster

ADD REPLY • link 5.7 years ago by Carambakaracho ★ 3.2k