I have been given the task of assembling a 'new' Ecoli genome and analysing the genes present etc.
The Ecoli is a new strain, and has been taken and run on a Nextseq 500 in high-output mode with 150bp paired end reads. The 'raw' files that I have is the forward and reverse reads.
I have initially QC checked the 'raw' files, and subsequently run them through trim galore and checked the QC after that.
For the next step, I now need to assemble my genome. I have been told that SPades will run a 'de novo' assembly for me, and then put that assembly into Prokka for Gene annotation.
Is this the best way to assemble the genome and annotate it? Or should I use another method? I am thinking that I should use a 'mapping' technique to assemble the genome using the Ecoli O157:H7 genome as a reference, but I have no idea how to do this. I would say that I am at an intermediate level with unix, but by no means am I a bioinformatician. Some help and guidance would be greatly appreciated!
SPAdes + Prokka is pretty much the de facto standard these days. There's little need to deviate unless you have very specific reasons.
You might gain improvements using reference guided assemblers such as Mira if your strains are very close, but don't map your reads first, you'd just be discarding data for no reason, instead let Mira (or whatever) decide that for you. E. coli in particular is known for its divergence, so a de novo assembly via SPAdes or similar is probably best - at least for a first pass.
The most compelling alternative assembly/annotation pipeline I can think of would be SKESA and PGAP which are NCBI's tools. If you uploaded the data to them, that's the assembly you'd get back so that can be useful.
Thank you very much for your reply! I think I will try both methods (SPades&Prokka, NCBI tools) and see which ones I get on with most.