Question: Outputting strain name on prokka annotation
0
gravatar for genomes_and_MGEs
28 days ago by
genomes_and_MGEs0 wrote:

Hey guys,

When I annotate genomes on prokka, using prokka --outdir mydir --prefix mygenome contigs.fa, I realized that the output .fna file only has this designation on fasta-header

>NODE_57_length_4609_cov_3.05627
ATTTTATTATGGTGATCCCCTGGGCGAAATGCGCCTGGTAAGCAGAGTTTTTGAAATGTA
AGGCCTTTGAATAAGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCCTTGATA
TCTAGTATGACGTCT...

How can I optimize the command, so that the strain name appears before NODE? Thanks!

sequence genome • 200 views
ADD COMMENTlink modified 25 days ago • written 28 days ago by genomes_and_MGEs0

Did you try the --genus Escherichia --species coli --strain POO247 command-line arguments?

ADD REPLYlink modified 28 days ago • written 28 days ago by h.mon24k

The question is that I have several genomes to annotate, and it would be great to create a loop to annotate the .fna file with the corresponding strain name!

ADD REPLYlink written 28 days ago by genomes_and_MGEs0

Put that command in a loop. This is just an outline.

for i in *.fna.gz
do
name=$(basename $i .fna.gz)
# you could break name further
prokka --outdir mydir --prefix mygenome contigs.fa --genus ${name} ...
done
ADD REPLYlink written 28 days ago by genomax65k

Didn't work... I still get the >NODE... designation within the .fna file. No strain name on the fasta-headers

ADD REPLYlink written 28 days ago by genomes_and_MGEs0

This was not code that was going to work. It was just an idea of how you could construct something that will allow you to use @h.mon's suggestion of using the correct command line options for prokka. Post some names you have and we can work on code that will get you closer to the solution.

ADD REPLYlink modified 28 days ago • written 28 days ago by genomax65k

So, I'm working on several E. coli genomes. One of the fasta files has this name: Escherichia_coli_146411.fasta Within the file, the fasta-headers are annotated like this

NODE_59_length_1215_cov_158.1 TGATCCTACCCAC... NODE_38_length_6200_cov_3508.08 CCCAGTGTTC... And so on. If I simply run prokka on this, it will output a .fna file with the exact same fasta-headers. If I run a loop like this for F in *.fna; do N=$(basename $F .fna) ; prokka --locustag $N --outdir $N --prefix $N $F ; done

It will create an output file Escherichia_coli_146411, with all relevant outputs (proteins and coding sequences) annotated accordingly to the strain name, but the outputted .fna file will maintain the exact fasta-headers as the input. Can you help me out?

ADD REPLYlink written 28 days ago by genomes_and_MGEs0

Thanks, this should work! Could you please convert this into a loop? Considering I have the strain name on the fasta filename, I would have to create a loop to fetch the strain filename and output it on the fasta-header of each contig on the .fna file. Thanks!

ADD REPLYlink written 25 days ago by genomes_and_MGEs0

There's no way to do this directly in the fasta header. You'll have to modify it after the fact. The --locus_tag argument governs the filenames however, so you can delineate your files based on that and modify afterwards using Genomax's suggestions (among others).

ADD REPLYlink written 25 days ago by jrj.healey11k
1
gravatar for genomax
28 days ago by
genomax65k
United States
genomax65k wrote:

In this case if you need to add the same strain name to each fasta header use the following tool from BBMap suite.

$ bbrename.sh in=new.fa out=renamed.fa prefix=Escherichia_coli_146411 addprefix=t

>Escherichia_coli_146411_NODE_57_length_4609_cov_3.05627
ATTTTATTATGGTGATCCCCTGGGCGAAATGCGCCTGGTAAGCAGAGTTTTTGAAATGTAAGGCCTTTGA
ATAAGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCCTTGATATCTAGTATGACGTCT
>Escherichia_coli_146411_NODE_58_length_4609_cov_3.05627
ATTTTATTATGGTGATCCCCTGGGCGAAATGCGCCTGGTAAGCAGAGTTTTTGAAATGTAAGGCCTTTGA
ATAAGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCCTTGATATCTAGTATGACGTCT
>Escherichia_coli_146411_NODE_53_length_4609_cov_3.05627
ATTTTATTATGGTGATCCCCTGGGCGAAATGCGCCTGGTAAGCAGAGTTTTTGAAATGTAAGGCCTTTGA
ATAAGACAAAAGGCTGCCTCATCGCTAACTTTGCAACAGTGCCCTTGATATCTAGTATGACGTCT
ADD COMMENTlink modified 28 days ago • written 28 days ago by genomax65k

Thanks, this should work! Could you please convert this into a loop? Considering I have the strain name on the fasta filename, I would have to create a loop to fetch the strain filename and output it on the fasta-header of each contig on the .fna file. Thanks!

ADD REPLYlink written 25 days ago by genomes_and_MGEs0

You can do that yourself. Should be a good exercise.

ADD REPLYlink written 25 days ago by genomax65k

Yeah, already tried with many approaches but couldn't manage. For example, I tried this

for F in *.fasta; 
  do N=$(basename $F .fasta) ;
  bbrename.sh out=$N prefix=$N  $F addprefix=t ; 
done

Can you please help me optimize this?

ADD REPLYlink modified 25 days ago by h.mon24k • written 25 days ago by genomes_and_MGEs0

This is not how you execute the bbrename.sh command. You can't omit required inputs e.g.in= and use random order of variables. You are on the right track though.

ADD REPLYlink written 25 days ago by genomax65k

I tried this

for F in *.fasta; do N=$(basename $F .fasta) ; bbrename.sh in=$F out=$N prefix=$F addprefix=t ; done

and

for F in *.fasta; do N=$(basename $F .fasta) ; bbrename.sh in=*.fasta out=$N prefix=$F addprefix=t ; done

but didn't work. Can you help me optimize this?

ADD REPLYlink modified 25 days ago by jrj.healey11k • written 25 days ago by genomes_and_MGEs0

In what way didn't it work? Please try to provide more information.

ADD REPLYlink written 25 days ago by jrj.healey11k

For each, I get this

java -ea -Xmx1g -cp /home/luisa/bbmap/current/ jgi.RenameReads in=Escherichia_coli_VREC0208.fasta out=Escherichia_coli_VREC0208 prefix=Escherichia_coli_VREC0208.fasta addprefix=t
Executing jgi.RenameReads [in=Escherichia_coli_VREC0208.fasta, out=Escherichia_coli_VREC0208, prefix=Escherichia_coli_VREC0208.fasta, addprefix=t]

Unspecified format for output Escherichia_coli_VREC0208; defaulting to fastq.
java.lang.Exception: 
An input file appears to be misformatted:
The character with ASCII code 69 appeared where a base was expected: 'E'
Sequence #58
Sequence ID: 'NODE_53_length_203_cov_10.5556_ID_105'
Sequence: '[84, 71, 84, 84, 65, 67, 67, 71, 67, 67, 71, 84, 71, 65, 65, 65, 71, 71, 71, 67, 71, 71, 84, 71, 84, 67, 67, 84, 71, 71, 71, 67, 67, 84, 67, 84, 65, 71, 65, 67, 71, 65, 65, 71, 71, 71, 71, 65, 67, 71, 84, 65, 84, 67, 65, 71, 84, 67, 84, 71, 67, 84, 84, 67, 71, 67, 65, 65, 71, 65, 67, 71, 67, 67, 84, 84, 71, 67, 84, 84, 84, 84, 67, 65, 67, 84, 84, 84, 67, 84, 65, 84, 67, 65, 71, 65, 67, 65, 65, 84, 67, 84, 71, 84, 71, 84, 71, 65, 71, 67, 65, 67, 84, 71, 67, 65, 65, 65, 71, 84, 65, 67, 71, 67, 84, 84, 67, 84, 84, 84, 65, 65, 71, 71, 84, 65, 65, 71, 71, 65, 71, 71, 84, 71, 65, 84, 67, 67, 65, 65, 67, 67, 71, 67, 65, 71, 71, 84, 84, 67, 67, 67, 67, 84, 65, 67, 71, 71, 84, 84, 65, 67, 67, 84, 84, 71, 84, 84, 65, 67, 71, 65, 67, 84, 84, 67, 65, 67, 67, 67, 67, 65, 71, 84, 67, 65, 84, 71, 65, 65, 84, 67, 65, 69, 115, 99, 104, 101, 114, 105, 99, 104, 105, 97, 95, 99, 111, 108, 105, 95]
TGTTACCGCCGTGAAAGGGCGGTGTCCTGGGCCTCTAGACGAAGGGGACGTATCAGTCTGCTTCGCAAGACGCCTTGCTTTTCACTTTCTATCAGACAATCTGTGTGAGCACTGCAAAGTACGCTTCTTTAAGGTAAGGAGGTGATCCAACCGCAGGTTCCCCTACGGTTACCTTGTTACGACTTCACCCCAGTCATGAATCAEscherichia_coli_'

This can be bypassed with the flag 'tossjunk', 'fixjunk', or 'ignorejunk'
    at shared.KillSwitch.kill(KillSwitch.java:96)
    at stream.Read.validateCommonCase_branchless(Read.java:386)
    at stream.Read.validate(Read.java:113)
    at stream.Read.<init>(Read.java:76)
    at stream.Read.<init>(Read.java:50)
    at stream.FastaReadInputStream.generateRead(FastaReadInputStream.java:267)
    at stream.FastaReadInputStream.fillList(FastaReadInputStream.java:182)
    at stream.FastaReadInputStream.nextList(FastaReadInputStream.java:91)
    at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:680)
    at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:656)
ADD REPLYlink modified 25 days ago by genomax65k • written 25 days ago by genomes_and_MGEs0

You loop is NOT producing the correct command line. It is important for you to understand what the commands are doing rather than just typing them out. While we can give you the correct command for this one instance you will not be able to modify it for other cases unless you "get" the idea behind writing these loops.

Please see the output of the following.

for F in *.fasta; do N=$(basename $F .fasta) ; echo bbrename.sh in=$F out=$N prefix=$F addprefix=t ; done

Once you make the appropriate changes so the command lines look like what I had in my original answer (but with the right input and output files), take the word echo out to execute them. Hint: You are not using $N and $F in right places and to create the right file names. e.g. output file name needs to become something line out=${N}_mod.fasta.

ADD REPLYlink modified 25 days ago • written 25 days ago by genomax65k

Thanks, it's working out! Thank you for your patience, but I'm just a beginner on this. Cheers

ADD REPLYlink written 24 days ago by genomes_and_MGEs0

You've clearly munged your fasta files if you actually read that error. You have a non-nucleotide character in the DNA sequence.

This is probably because you've run the command incorrectly and wrecked the files in your current directory. You'll need to re-create them before trying this step (correctly) again.

ADD REPLYlink written 24 days ago by jrj.healey11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 749 users visited in the last hour