Question

Convert nucleotide fasta of contigs to protein sequence genbank

0

Entering edit mode

4.4 years ago

claping ▴ 20

I recently ran Prokka on a .fasta file of ordered contig DNA sequences. The file looks like this:

>NODE_6_length_103505_cov_20.694918
agcgattagcccccagcgattagcgaaaagcgattttttattagcgattagcgattagcg
attagcgattagcgattttagaactatttgcctaaatttctgcttaaatatctacagaaa

The annotation went (fairly) well, but I would like to rerun the program on the correlated amino acid sequence ordered contigs. The Prokka vignette resource online states:

"If you have Genbank or Protein FASTA file(s) that you want to annotate genes from as the first priority, use the --proteins myfile.gbk. Please make sure it has a recognisable file extension like .gb or .gbk or auto-detect will fail."

This may be a basic question, but how can I confidently translate my ordered DNA sequence .fasta file into ordered protein sequence .gb or .gbk file? I am hoping to find a straightforward solution (I have basic Linux and R skills).

fasta nucleotide genbank protein prokka • 3.3k views

ADD COMMENT • link 4.4 years ago by claping ▴ 20

1

Entering edit mode

It seemed to me that what you need is only a reference genome in the format of gbk (and of high quality particularly if there are /gene and /EC_number) so that the naming would be consistent.

ADD REPLY • link 4.4 years ago by Sishuo Wang ▴ 230

0

Entering edit mode

Thanks @SishuoWang. I am hoping to input the unknown genome itself in protein sequence. Do you think this is possible in Prokka? Or, is it only possible to supply a reference genome in protein sequence (along with the unknown genome in DNA sequence)? I may be misunderstanding the Prokka vignette

ADD REPLY • link 4.4 years ago by claping ▴ 20

1

Entering edit mode

correlated amino acid sequence ordered contigs

What does this mean?

It sounds to me like you want to re-run the analysis, using the annotation you just created as some kind of reference?

Contig ordering is irrelevant as far as prokka is concerned.

ADD REPLY • link 4.4 years ago by Joe 21k

2

Entering edit mode

Thanks @Joe. I had ran Prokka genome annotation analysis using the input of the genome I want annotated in the .fasta format of DNA sequence of contigs (with an example of the format shown in my original post). However, I wanted to convert that .fasta format input into its corresponding amino acid sequence (so an amino acid sequence of contigs) and then use that input instead to be annotated by Prokka. Is this possible? It seems so in the vignette:

"If you have Genbank or Protein FASTA file(s) that you want to annotate genes from as the first priority, use the --proteins myfile.gbk. Please make sure it has a recognisable file extension like .gb or .gbk or auto-detect will fail."

I have been having a hard time converting my new input file from the original DNA sequence contig fasta file into its corresponding Protein Genbank/FASTA file that the vignette calls for. Here is what I tried so far:

1) Converted the DNA sequence to amino acid sequence using emboss_transeq. This gave me a protein sequence file in .fasta format. 2) Converted the protein sequence file in .fasta format to .gbk format using fasta_to_genbank. This gave me a protein sequence file in .genbank format that looked like as follows:

LOCUS       NODE_15_1\             15902 aa                     UNK 01-JAN-1980
DEFINITION  NODE_15_1\
ACCESSION   NODE_15_1\
VERSION     NODE_15_1\
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
ORIGIN
        1 ipqcfvlpkk yl*pi*tstp *aepltfwvw vlgrvmvgti pklastlkal lca**ttlla
       61 \f*mtlrafp klvfsssgif pfitslsafs tleraslasa i*lfk*ytlt *glffklevi
      121 *\ftalkacc kdeiialslg *lrffral*r vwrlscmcvk egsvlliakg vgvkgivaif
      181 gv\*gcgiel gflv*dwvlp sf*gcgfigv lglmvglnsl *rfsyftkfs kslakv*psk

I noticed, for some reason, along this conversion process, I lost the first contig. Nonetheless, when I input this file into Prokka using the command:

prokka --proteins inputFile.gbk

I receive the error:

[15:02:46] Annotating as >>> Bacteria <<<
[15:02:46] Please supply a contig FASTA file on the command line.

So, I am unsure how to create the appropriate input file type needed for Prokka if I want to annotate genes as a first priority from a "Protein FASTA file(s)".

ADD REPLY • link 4.4 years ago by claping ▴ 20

1

Entering edit mode

No, you're misunderstanding.

You don't need fully translated contigs. You just need a selection of protein sequences to act as a 'trusted database' from which Prokka will start. The intended use is, for example, if you study a weird bacteria without many genomes, but there already exists 1 reference sequence (say) which has many hand curated annotations etc.

You're recieving your error because you aren't supplying the 'blank' contigs.fasta which prokka acts on. This is usually the final positional argument if memory serves.

Basically what you need is:

prokka --proteins referenceproteins.fasta mycontigs.fasta

or

prokka --proteins referenceproteins.gbk mycontigs.fasta

Where referenceproteins is one of a multifasta (of amino acid sequences with correctly formatted headers), OR a genbank of an existing annotated set of sequences. Your contigs.fasta file should still be a DNA sequence file of contigs which form the basis to be annotated.

It sounds like you've significantly munged your data, so I would start over if I were you.

NB - there's no requirement to use --proteins, you only need it if you know you have a set of proteins you trust over the annotations that might be picked up via HMMer/BLAST etc which prokka will run.

ADD REPLY • link 4.4 years ago by Joe 21k

score 0 · Answer 1 · 2019-12-11

0

Entering edit mode

4.4 years ago

claping ▴ 20

Thanks @Joe. I see now that I misinterpreted the vignette. This is a weird bacteria, but there is an appropriate reference genome for it (located here). I am able to download the DNA fasta file for this reference genome. However, it seems that I need a file like (referenceproteins.fasta or referenceproteins.gbk) that, as you state, contains sequenes of amino acids with correct headers. Do you know how I can determine whether this reference genome has such files available (or how I could create them)? Thank you again for sharing your advice!

ADD COMMENT • link 4.4 years ago by claping ▴ 20

1

Entering edit mode

Yes, you just need to look at the right 'bit' of NCBI, or use the send-to-file functionality:

You can also send the whole annotation to a GenBank file

ADD REPLY • link 4.4 years ago by Joe 21k