Linking bacterial taxonomy to predicted genes with Prokka
2
0
Entering edit mode
2.2 years ago
Hansen_869 ▴ 20

Hi!

I have recently binned a whole lot of contigs, coming from a metagenomic sample. I have merged all my bins into one file, but the fasta-headers (>bin-name) have descriptions about what bacterial taxonomy the contig represents.

The next step was to annotate the genes of my bins. When i annotate the genes (with Prokka/Prodigal), I get a single file, with all predicted genes. However, I have no way of knowing what gene belongs to what bacteria, since the bacterial taxonomy headers are not preserved.

Do you guys have any ideas on how to know what genes belong to what bacteria? I would rather not run the bins 1 by 1, as I have thousands of them, and it would generate thousands of files to manage (that's why i merged all the bins).

prokka bwa maxbin • 921 views
1
Entering edit mode

Thanks for your suggestions guys! Seems like the locus_tag is not an option, with as many contigs as I have (a lot). I never studied the .gff file enough. Seems like a simple python script, could help me out there!! If that won't workout, I'll definitely look into just running the bins separately, and then merging.

I'll get back to this thread, if it doesn't work out.

0
Entering edit mode

I have no way of knowing what gene belongs to what bacteria, since the bacterial taxonomy headers are not preserved.

Do you have spaces in your fasta headers? Can you replace them with _ so when you run Prokka that information is preserved?

0
Entering edit mode

>001__g__Clostridium_2_no_1
ATCAATACATACACTGGTACAATGAAAAACGCATTAAAATGTCACTTGGCGGAATGAGTCCGCTAAACTA
TAGAAAGAGTCTGGGGTTGGTCGCTTAAGTCCAACTTTATGTCCGCACCCTTGTGCGGGCAAGAGAGTGG
AATTTAAAATTTCATTTCAGGGCCATTTTCGATTTATAGGTGTAACTTGTAATATAGGGGGCTGAGTAGT
ATAATTATTTATTATATATTTAAGAAAGTTCGTATAATTGTTATTATGAATAATGGTCATAAAAGTTGCC
>002__f__Lachnospiraceae_no_1
ACTTCTTTTTTCTGGAAATGATCTCTTTTTTATTCGCTCTCTCTTTGCAATGCCTTAGCTTAATGACATT
GACCTCACGGCCTGAGCGATTTTTCTTTTTAGGGACTTATTTTACAGCCCCCTAAATGACACTGATCGCT
ATCTGCCAATACAGATTGTCTCTGCTTATTTTTTCTTATCCTTAGGATTCTTATATACCCAGAGTGCACC
AAGAATTCCTGCTATGTATAGGATGTTCACAAGAATGGACGGGTTATCCTTCAAGAGGTAAACGATCGCG


And my Prokka-output look like this:

>FCHNPLMC_00004 Transcriptional regulatory protein WalR
MTDSKILLVDDEKDIVDLMEEVLRQDGFLEIRRAYRGSEAVTLCREFQPDAVILDVMLPD
MDGLEVCRRIREFSYCSILFLSSRNDDIDKILGLSSGGDDYITKPFSPREVAFRVKAQLR
RQRYQNAPSPAVSSVLTAGPLSLDQESGRVWKNGREISLTGREFLLLSYLMENTDKIISK
>FCHNPLMC_00005 Multiple sugar-binding protein
MPKKLLALFLVLTCAASAITGCSSSKNRVVNEDNQIDQEIVTITFFGNKYEPENVIVIEQ
SGLSTLPDFTDEMRSQMGEGKITWVPTTVSIFGLYCNLDLLKEHKQEVPETLSEWEAVCE
YFVNCGITPVIANNDISLKTLAIGRSFWQVYQDKRQTEVFGQLNHGRETLSEYLTDGFSI


So I guess I would have to tell Prokka, to preserve the fasta-headers somehow?

0
Entering edit mode

I never ran Prokka but does the "FCHNPLMC_00004" not relate back to any of your contigs? This is a bit bizarre. Anyhow, Prokka uses Prodigal for the prediction of CDS', you can run it alone - either in metagenome mode or, ideally on each bin of your assembly individually to allow it to train itself instead of using a approximation of one of the incorporated models. It shouldn't make much difference to have an intermediate step with thousands of files, you can always merge them later.

0
Entering edit mode

If the locus_tag option you feed to prokka is sufficiently unique (i.e. unique to each species/strain), then you can work out what genes come from what genomes just by the locus tag, which will be inserted in to the fasta headers.

2
Entering edit mode
2.2 years ago
Mark ★ 1.1k

Prokka doesn't preserve contig names unfortunately :( https://github.com/tseemann/prokka/issues/183

My suggestion would be to run Prokka on each contig individually then merge later on.

2
Entering edit mode
2.2 years ago

Prokka should give you a gff where you can find the association between the contig and its CDS. You could use that file to rename the header of your CDS