Question: Linking bacterial taxonomy to predicted genes with Prokka
0
gravatar for Hansen_869
13 months ago by
Hansen_86920
Hansen_86920 wrote:

Hi!

I have recently binned a whole lot of contigs, coming from a metagenomic sample. I have merged all my bins into one file, but the fasta-headers (>bin-name) have descriptions about what bacterial taxonomy the contig represents.

The next step was to annotate the genes of my bins. When i annotate the genes (with Prokka/Prodigal), I get a single file, with all predicted genes. However, I have no way of knowing what gene belongs to what bacteria, since the bacterial taxonomy headers are not preserved.

Do you guys have any ideas on how to know what genes belong to what bacteria? I would rather not run the bins 1 by 1, as I have thousands of them, and it would generate thousands of files to manage (that's why i merged all the bins).

maxbin bwa prokka • 412 views
ADD COMMENTlink modified 13 months ago • written 13 months ago by Hansen_86920
1

Thanks for your suggestions guys! Seems like the locus_tag is not an option, with as many contigs as I have (a lot). I never studied the .gff file enough. Seems like a simple python script, could help me out there!! If that won't workout, I'll definitely look into just running the bins separately, and then merging.

I'll get back to this thread, if it doesn't work out.

ADD REPLYlink written 13 months ago by Hansen_86920

I have no way of knowing what gene belongs to what bacteria, since the bacterial taxonomy headers are not preserved.

Do you have spaces in your fasta headers? Can you replace them with _ so when you run Prokka that information is preserved?

ADD REPLYlink written 13 months ago by genomax92k

My fasta-headers look like this:

>001__g__Clostridium_2_no_1
ATCAATACATACACTGGTACAATGAAAAACGCATTAAAATGTCACTTGGCGGAATGAGTCCGCTAAACTA
TAGAAAGAGTCTGGGGTTGGTCGCTTAAGTCCAACTTTATGTCCGCACCCTTGTGCGGGCAAGAGAGTGG
AATTTAAAATTTCATTTCAGGGCCATTTTCGATTTATAGGTGTAACTTGTAATATAGGGGGCTGAGTAGT
ATAATTATTTATTATATATTTAAGAAAGTTCGTATAATTGTTATTATGAATAATGGTCATAAAAGTTGCC
>002__f__Lachnospiraceae_no_1
ACTTCTTTTTTCTGGAAATGATCTCTTTTTTATTCGCTCTCTCTTTGCAATGCCTTAGCTTAATGACATT
GACCTCACGGCCTGAGCGATTTTTCTTTTTAGGGACTTATTTTACAGCCCCCTAAATGACACTGATCGCT
ATCTGCCAATACAGATTGTCTCTGCTTATTTTTTCTTATCCTTAGGATTCTTATATACCCAGAGTGCACC
AAGAATTCCTGCTATGTATAGGATGTTCACAAGAATGGACGGGTTATCCTTCAAGAGGTAAACGATCGCG

And my Prokka-output look like this:

>FCHNPLMC_00004 Transcriptional regulatory protein WalR
MTDSKILLVDDEKDIVDLMEEVLRQDGFLEIRRAYRGSEAVTLCREFQPDAVILDVMLPD
MDGLEVCRRIREFSYCSILFLSSRNDDIDKILGLSSGGDDYITKPFSPREVAFRVKAQLR
RQRYQNAPSPAVSSVLTAGPLSLDQESGRVWKNGREISLTGREFLLLSYLMENTDKIISK
ERLYEQVWGESSCICDNTIMVHIRHLREKTEADPSKPQQLITVKGLGYKLKKRIE
>FCHNPLMC_00005 Multiple sugar-binding protein
MPKKLLALFLVLTCAASAITGCSSSKNRVVNEDNQIDQEIVTITFFGNKYEPENVIVIEQ
IISDFMRENPSVRVSYESLKGNDYFEALEKRMEHGRGDDIFMVNHDVLLKLEADGQVADL
SGLSTLPDFTDEMRSQMGEGKITWVPTTVSIFGLYCNLDLLKEHKQEVPETLSEWEAVCE
YFVNCGITPVIANNDISLKTLAIGRSFWQVYQDKRQTEVFGQLNHGRETLSEYLTDGFSI

So I guess I would have to tell Prokka, to preserve the fasta-headers somehow?

ADD REPLYlink written 13 months ago by Hansen_86920

I never ran Prokka but does the "FCHNPLMC_00004" not relate back to any of your contigs? This is a bit bizarre. Anyhow, Prokka uses Prodigal for the prediction of CDS', you can run it alone - either in metagenome mode or, ideally on each bin of your assembly individually to allow it to train itself instead of using a approximation of one of the incorporated models. It shouldn't make much difference to have an intermediate step with thousands of files, you can always merge them later.

ADD REPLYlink written 13 months ago by Carambakaracho2.2k

If the locus_tag option you feed to prokka is sufficiently unique (i.e. unique to each species/strain), then you can work out what genes come from what genomes just by the locus tag, which will be inserted in to the fasta headers.

ADD REPLYlink written 13 months ago by Joe18k
2
gravatar for Mark
13 months ago by
Mark800
Mark800 wrote:

Prokka doesn't preserve contig names unfortunately :( https://github.com/tseemann/prokka/issues/183

My suggestion would be to run Prokka on each contig individually then merge later on.

ADD COMMENTlink written 13 months ago by Mark800
2
gravatar for andres.firrincieli
13 months ago by
andres.firrincieli1.0k wrote:

Prokka should give you a gff where you can find the association between the contig and its CDS. You could use that file to rename the header of your CDS

ADD COMMENTlink written 13 months ago by andres.firrincieli1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2037 users visited in the last hour