Question

Outputting strain name on prokka annotation

0

Entering edit mode

5.1 years ago

genomes_and_MGEs ▴ 10

Hey guys,

When I annotate genomes on prokka, using prokka --outdir mydir --prefix mygenome contigs.fa, I realized that the output .fna file only has this designation on fasta-header

>NODE_57_length_4609_cov_3.05627
ATTTTATTATGGTGATCCCCTGGGCGAAATGCGCCTGGTAAGCAGAGTTTTTGAAATGTA
AGGCCTTTGAATAAGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCCTTGATA
TCTAGTATGACGTCT...

How can I optimize the command, so that the strain name appears before NODE? Thanks!

genome sequence • 4.9k views

ADD COMMENT • link 5.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Did you try the --genus Escherichia --species coli --strain POO247 command-line arguments?

ADD REPLY • link 5.1 years ago by h.mon 35k

0

Entering edit mode

The question is that I have several genomes to annotate, and it would be great to create a loop to annotate the .fna file with the corresponding strain name!

ADD REPLY • link 5.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Put that command in a loop. This is just an outline.

for i in *.fna.gz
do
name=$(basename $i .fna.gz)
# you could break name further
prokka --outdir mydir --prefix mygenome contigs.fa --genus ${name} ...
done

ADD REPLY • link 5.1 years ago by GenoMax 142k

0

Entering edit mode

Didn't work... I still get the >NODE... designation within the .fna file. No strain name on the fasta-headers

ADD REPLY • link 5.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

This was not code that was going to work. It was just an idea of how you could construct something that will allow you to use @h.mon's suggestion of using the correct command line options for prokka. Post some names you have and we can work on code that will get you closer to the solution.

ADD REPLY • link 5.1 years ago by GenoMax 142k

0

Entering edit mode

So, I'm working on several E. coli genomes. One of the fasta files has this name: Escherichia_coli_146411.fasta Within the file, the fasta-headers are annotated like this

NODE_59_length_1215_cov_158.1 TGATCCTACCCAC... NODE_38_length_6200_cov_3508.08 CCCAGTGTTC... And so on. If I simply run prokka on this, it will output a .fna file with the exact same fasta-headers. If I run a loop like this for F in *.fna; do N=$(basename $F .fna) ; prokka --locustag $N --outdir $N --prefix $N $F ; done

It will create an output file Escherichia_coli_146411, with all relevant outputs (proteins and coding sequences) annotated accordingly to the strain name, but the outputted .fna file will maintain the exact fasta-headers as the input. Can you help me out?

ADD REPLY • link 5.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Thanks, this should work! Could you please convert this into a loop? Considering I have the strain name on the fasta filename, I would have to create a loop to fetch the strain filename and output it on the fasta-header of each contig on the .fna file. Thanks!

ADD REPLY • link 5.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

There's no way to do this directly in the fasta header. You'll have to modify it after the fact. The --locus_tag argument governs the filenames however, so you can delineate your files based on that and modify afterwards using Genomax's suggestions (among others).

ADD REPLY • link 5.1 years ago by Joe 21k

h.mon · Answer 1 · 2019-03-22

1

Entering edit mode

5.1 years ago

GenoMax 142k

In this case if you need to add the same strain name to each fasta header use the following tool from BBMap suite.

$ bbrename.sh in=new.fa out=renamed.fa prefix=Escherichia_coli_146411 addprefix=t

>Escherichia_coli_146411_NODE_57_length_4609_cov_3.05627
ATTTTATTATGGTGATCCCCTGGGCGAAATGCGCCTGGTAAGCAGAGTTTTTGAAATGTAAGGCCTTTGA
ATAAGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCCTTGATATCTAGTATGACGTCT
>Escherichia_coli_146411_NODE_58_length_4609_cov_3.05627
ATTTTATTATGGTGATCCCCTGGGCGAAATGCGCCTGGTAAGCAGAGTTTTTGAAATGTAAGGCCTTTGA
ATAAGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCCTTGATATCTAGTATGACGTCT
>Escherichia_coli_146411_NODE_53_length_4609_cov_3.05627
ATTTTATTATGGTGATCCCCTGGGCGAAATGCGCCTGGTAAGCAGAGTTTTTGAAATGTAAGGCCTTTGA
ATAAGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCCTTGATATCTAGTATGACGTCT

ADD COMMENT • link 5.1 years ago by GenoMax 142k

0

Entering edit mode

Thanks, this should work! Could you please convert this into a loop? Considering I have the strain name on the fasta filename, I would have to create a loop to fetch the strain filename and output it on the fasta-header of each contig on the .fna file. Thanks!

ADD REPLY • link 5.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

You can do that yourself. Should be a good exercise.

ADD REPLY • link 5.1 years ago by GenoMax 142k

0

Entering edit mode

Yeah, already tried with many approaches but couldn't manage. For example, I tried this

for F in *.fasta; 
  do N=$(basename $F .fasta) ;
  bbrename.sh out=$N prefix=$N  $F addprefix=t ; 
done

Can you please help me optimize this?

ADD REPLY • link updated 5.1 years ago by h.mon 35k • written 5.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

This is not how you execute the bbrename.sh command. You can't omit required inputs e.g.in= and use random order of variables. You are on the right track though.

ADD REPLY • link 5.1 years ago by GenoMax 142k

0

Entering edit mode

I tried this

for F in *.fasta; do N=$(basename $F .fasta) ; bbrename.sh in=$F out=$N prefix=$F addprefix=t ; done

and

for F in *.fasta; do N=$(basename $F .fasta) ; bbrename.sh in=*.fasta out=$N prefix=$F addprefix=t ; done

but didn't work. Can you help me optimize this?

ADD REPLY • link updated 5.1 years ago by Joe 21k • written 5.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

In what way didn't it work? Please try to provide more information.

ADD REPLY • link 5.1 years ago by Joe 21k

0

Entering edit mode

For each, I get this

java -ea -Xmx1g -cp /home/luisa/bbmap/current/ jgi.RenameReads in=Escherichia_coli_VREC0208.fasta out=Escherichia_coli_VREC0208 prefix=Escherichia_coli_VREC0208.fasta addprefix=t
Executing jgi.RenameReads [in=Escherichia_coli_VREC0208.fasta, out=Escherichia_coli_VREC0208, prefix=Escherichia_coli_VREC0208.fasta, addprefix=t]

Unspecified format for output Escherichia_coli_VREC0208; defaulting to fastq.
java.lang.Exception: 
An input file appears to be misformatted:
The character with ASCII code 69 appeared where a base was expected: 'E'
Sequence #58
Sequence ID: 'NODE_53_length_203_cov_10.5556_ID_105'
Sequence: '[84, 71, 84, 84, 65, 67, 67, 71, 67, 67, 71, 84, 71, 65, 65, 65, 71, 71, 71, 67, 71, 71, 84, 71, 84, 67, 67, 84, 71, 71, 71, 67, 67, 84, 67, 84, 65, 71, 65, 67, 71, 65, 65, 71, 71, 71, 71, 65, 67, 71, 84, 65, 84, 67, 65, 71, 84, 67, 84, 71, 67, 84, 84, 67, 71, 67, 65, 65, 71, 65, 67, 71, 67, 67, 84, 84, 71, 67, 84, 84, 84, 84, 67, 65, 67, 84, 84, 84, 67, 84, 65, 84, 67, 65, 71, 65, 67, 65, 65, 84, 67, 84, 71, 84, 71, 84, 71, 65, 71, 67, 65, 67, 84, 71, 67, 65, 65, 65, 71, 84, 65, 67, 71, 67, 84, 84, 67, 84, 84, 84, 65, 65, 71, 71, 84, 65, 65, 71, 71, 65, 71, 71, 84, 71, 65, 84, 67, 67, 65, 65, 67, 67, 71, 67, 65, 71, 71, 84, 84, 67, 67, 67, 67, 84, 65, 67, 71, 71, 84, 84, 65, 67, 67, 84, 84, 71, 84, 84, 65, 67, 71, 65, 67, 84, 84, 67, 65, 67, 67, 67, 67, 65, 71, 84, 67, 65, 84, 71, 65, 65, 84, 67, 65, 69, 115, 99, 104, 101, 114, 105, 99, 104, 105, 97, 95, 99, 111, 108, 105, 95]
TGTTACCGCCGTGAAAGGGCGGTGTCCTGGGCCTCTAGACGAAGGGGACGTATCAGTCTGCTTCGCAAGACGCCTTGCTTTTCACTTTCTATCAGACAATCTGTGTGAGCACTGCAAAGTACGCTTCTTTAAGGTAAGGAGGTGATCCAACCGCAGGTTCCCCTACGGTTACCTTGTTACGACTTCACCCCAGTCATGAATCAEscherichia_coli_'

This can be bypassed with the flag 'tossjunk', 'fixjunk', or 'ignorejunk'
    at shared.KillSwitch.kill(KillSwitch.java:96)
    at stream.Read.validateCommonCase_branchless(Read.java:386)
    at stream.Read.validate(Read.java:113)
    at stream.Read.<init>(Read.java:76)
    at stream.Read.<init>(Read.java:50)
    at stream.FastaReadInputStream.generateRead(FastaReadInputStream.java:267)
    at stream.FastaReadInputStream.fillList(FastaReadInputStream.java:182)
    at stream.FastaReadInputStream.nextList(FastaReadInputStream.java:91)
    at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:680)
    at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:656)

ADD REPLY • link updated 5.1 years ago by GenoMax 142k • written 5.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

You loop is NOT producing the correct command line. It is important for you to understand what the commands are doing rather than just typing them out. While we can give you the correct command for this one instance you will not be able to modify it for other cases unless you "get" the idea behind writing these loops.

Please see the output of the following.

for F in *.fasta; do N=$(basename $F .fasta) ; echo bbrename.sh in=$F out=$N prefix=$F addprefix=t ; done

Once you make the appropriate changes so the command lines look like what I had in my original answer (but with the right input and output files), take the word echo out to execute them. Hint: You are not using $N and $F in right places and to create the right file names. e.g. output file name needs to become something line out=${N}_mod.fasta.

ADD REPLY • link 5.1 years ago by GenoMax 142k

0

Entering edit mode

Thanks, it's working out! Thank you for your patience, but I'm just a beginner on this. Cheers

ADD REPLY • link 5.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

You've clearly munged your fasta files if you actually read that error. You have a non-nucleotide character in the DNA sequence.

This is probably because you've run the command incorrectly and wrecked the files in your current directory. You'll need to re-create them before trying this step (correctly) again.

ADD REPLY • link 5.1 years ago by Joe 21k