Question

How to parse protein and dna sequences from prokka generated gbk or gff file based on locus_tag?

0

Entering edit mode

3.2 years ago

Kumar ▴ 120

I need to parse accessory gene sequences (both dna and amino acid sequences) from roary pangenome output. I have the locus_tag list and their corresponding gbk and gff files, Is there any way to extract both amino acid and dna sequences from the gbk or gff files.The gbk and gff file were generated through prokka pipeline. Is there any tool to do the same. The roary accessory genes locus_tag list and corresponding strain gbk and gff file samples are shown below,

locus_tag list.csv

             locus_tag/Pcissicola19
    xynB_1   BGDHLHFA_02833
    smpB     BGDHLHFA_01427

Pcissicola19.gbk

gene            complement(39965..40852)
                     /gene="xynB_3"
                     /locus_tag="BGDHLHFA_02833"
     CDS             complement(39965..40852)
                     /gene="xynB_3"
                     /locus_tag="BGDHLHFA_02833"
                     /EC_number="3.2.1.37"
                     /inference="ab initio prediction:Prodigal:002006"
                     /inference="similar to AA sequence:UniProtKB:P36906"
                     /codon_start=1
                     /transl_table=11
                     /product="Beta-xylosidase"
                     /protein_id="Prokka:BGDHLHFA_02833"
                     /translation="MPELLAFVAKHKLPIDFVTTHTYGVDGGFLDENGKQDTKLSASL
                     DAIVGDVRRVRAQIQASPFPNLPLYFTQWSSSYTPRDFVHDSYISAPYILTKLKQVQG
                     LVQGMSYWTYTDLFEEPGPPPTPFHGGFGLMNREGIRKPAWFAYKYLHALKGRDVPLS
                     DAHSLAAVDGTRVAALVWNWQQPMQAVSNTPFYTKQVPATDSAPLRMRMTHVPAGTYQ
                     LQVRKTGYRRNDPLSLYIDMGMPKDLAPRQLTQLRQATHDAPEQDRRVRVGADGVVEI
                     NVPMRSNDVVLLTLEPAAR"

Pcissicola19.gff

ID=BGDHLHFA_02833_gene;Name=xynB_3;gene=xynB_3;locus_tag=BGDHLHFA_02833
gnl|Prokka|BGDHLHFA_249 Prodigal:002006 CDS 39965   40852   .   -   0   ID=BGDHLHFA_02833;Parent=BGDHLHFA_02833_gene;eC_number=3.2.1.37;Name=xynB_3;gene=xynB_3;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P36906;locus_tag=BGDHLHFA_02833;product=Beta-xylosidase;protein_id=gnl|Prokka|BGDHLHFA_02833

For your kind reference my datasets having both draft genome and complete genomes.

The expected dna and amino acid sequence output is given below respectively,

>BGDHLHFA_02833
tcagcgcgccgccggctccagcgtcagcagcaccacatcgttgctgcgcatcggcacgttgatctcgaccacgccatcggcgcccacacgcacacgccgatcctgttcgggcatcgtgcgtggcctgtcgcagctgcgtcaactggcgcggcgccaggtccttgggcatgcccatgtcgatgtacagcgacaacgggtcgttacgccgatagccggtcttgcgcacctgcagctggtacgtgccggcaggcacatgggtcatgcgcatgcgcagcggcgcgctgtcggtggcgggcacctgtttggtgtagaacggcgtattgctcaccgcctgcatgggctgctgccaattccacaccagtgcggcgacgcgcgtgccgtccactgcggcgagggaatgtgcgtcgctcagcggcacatcgcggcccttgagcgcatgcaagtacttgtaagcgaaccaggccggtttgcgaatgccttcgcgattcatcagcccaaacccgccgtggaagggcgtgggcggtgggccgggttcttcgaacagatcggtatagtccagtaactcatgccctgcaccaggccctgcacctgcttgagcttggtcaggatgtacggcgcgctgatgtaactgtcgtggacgaaatcgcgcggcgtatagctgctgctccactgggtgaagtacagcggcaggttgggaaatggcgaggcctggatctgcgcgcgcacgcgtcgcacatcgccgacgatggcatccagagatgcggacagcttggtgtcctgcttgccgttctcatcgagaaacccgccatccacgccataggtatgcgtggtgacgaagtcgatcggcagtttgtgcttggcaacgaaggccagcagttccggcac

>BGDHLHFA_02833
MPELLAFVAKHKLPIDFVTTHTYGVDGGFLDENGKQDTKLSASLDAIVGDVRRVRAQIQASPFPNLPLYFTQWSSSYTPRDFVHDSYISAPYILTKLKQVQGLVQGMSYWTYTDLFEEPGPPPTPFHGGFGLMNREGIRKPAWFAYKYLHALKGRDVPLSDAHSLAAVDGTRVAALVWNWQQPMQAVSNTPFYTKQVPATDSAPLRMRMTHVPAGTYQLQVRKTGYRRNDPLSLYIDMGMPKDLAPRQLTQLRQATHDAPEQDRRVRVGADGVVEINVPMRSNDVVLLTLEPAAR

genome perl python bash R • 2.1k views

ADD COMMENT • link updated 3.2 years ago by Mensur Dlakic ★ 27k • written 3.2 years ago by Kumar ▴ 120

0

Entering edit mode

Please post example file/lines for better understanding the issue and do not post images of the data.

ADD REPLY • link 3.2 years ago by cpad0112 21k

0

Entering edit mode

@cpad0112 I have revised my question. Please go through it.

ADD REPLY • link 3.2 years ago by Kumar ▴ 120

1

Entering edit mode

I recently posted here how to extract aa and nt sequenece (C: How to extract all gene nucleotide sequences separately from multiple Genbank fi) from gbk. What you need to do is extract the locus_tag and loop over those tags and extract only those sequences from gbk.

ADD REPLY • link 3.2 years ago by cpad0112 21k

score 2 · Accepted Answer · 2021-02-18

2

Entering edit mode

3.2 years ago

Mensur Dlakic ★ 27k

prokka makes .ffn and .faa files, which contain codons and their translations, respectively. They should have the same annotations as .gbk files. In this case you don't need to parse anything - just extract the sequences of interest directly from these files.