Question: taxonomic identification least common ancestor approach
1
gravatar for Moses
3 months ago by
Moses70
united states/ Bloomingtion/ Indiana University Bloomington
Moses70 wrote:

Hi all,

I have 3,300 binned contigs (bacterial sequences) that I would like to know the species (where possible) or the least common ancestor explaining what clades each of these bins would be coming from. I understand that MEGAN is designed to do that, however I have my build my own phylogenetic tree and would like to annotate as much as possible clade information for these bins (as specific as I can get). To do that I have predicted all the protein coding genes for each bin and extracted marker genes (ribosomal proteins and elongation factors) from these bins (anywhere between 30 to 80 genes/ bin depending on the size of the genome or completeness of the binning process), and I am blasting these marker genes/proteins (sicne i'm using their protein sequences, I'm pblasting it against the nr database) against the nr database so that later I can use MEGAN or a a tool of that sort to infer taxonomy.

My questions are the following: is there a better tool than MEGAN out there that can infer taxonomy from my sequences? Blasting these marker genes against the nr database would take up a month I think at this rate. Does anyone know any other methods/techniques to do this? I just want to come up with a list that maps my bins to it's most specific taxonomy, be it at the species level, genus level or higher levels, however specific it could get.

Many thanks in advance.

phylum genus taxonomy phylogeny • 145 views
ADD COMMENTlink modified 3 months ago by Asaf6.1k • written 3 months ago by Moses70
0
gravatar for Asaf
3 months ago by
Asaf6.1k
Israel
Asaf6.1k wrote:

To answer some of your questions, you can use diamond instead of BLAST, it will accelerate running time dramatically. I don't know why you compare your proteins against nr, the reference dataset should be much smaller - only the specific orthology groups from bacteria.

To suggest other tools - you can use checkm lineage_wf which you can supply your list of genes with (--genes) and it will give you the best taxonomic identity. I'm not sure but I guess there is a way to alter the database to use your taxonomic tree (which I haven't fully understood what it contains)

ADD COMMENTlink written 3 months ago by Asaf6.1k

Hi, thank you for your comments and sorry for the late reply. I actually was looking into checkm and trying to get it running. It uses Python2 unfortunately and I had to roll back my python version to 2 and install dependencies etc. I just issued a run and see what it results. Thank you for your suggestion, I hope I can get it to work.

ADD REPLYlink written 3 months ago by Moses70

The way to go is to create a virtualenv using python2 and install checkm on that virtualenv. You should try taxonomy_wf as well, this is the one you're looking for

ADD REPLYlink written 3 months ago by Asaf6.1k

taxonomy_wf sounds more like identifying a particular phylum, and involves pre-specifying the phylum beforehand, The bins that I want to do taxonomic identification are very diverse and I don't see how I can run the taxonomy_wf command, instead lineage_wf is just extracting general marker genes I think and is inferring various taxonomies. Although I'm quite new to this and not entirely sure if I'm understanding this correctly,.

ADD REPLYlink written 3 months ago by Moses70

You're right. I took a look at my code. I first run lineage_wf and then tree_qa with the lineage_wf output dir as input. The tree_qa will give the best assignment on the tree

ADD REPLYlink written 3 months ago by Asaf6.1k

oh I see, right now I'm running the lineage_wf command over my bins, maybe I should also try and run tree_qa after it's done. Thanks for the advice.

ADD REPLYlink written 3 months ago by Moses70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1080 users visited in the last hour