Hi, I'm a new bioinformatician and wondering if what I want is possible or even logical.
I sequenced (slightly) mutant E coli, genome should be approx 4.6GB. I used SPADES to do a de novo assembly from my paired sequencing but end with 2000 contigs and the largest is 25kb. My sequenced bacteria should be very similar to a published genome. Is there a way to use this published genome to help build my sequenced contigs/scaffolds rather than only using spades to do de novo assembly? Reading the spades manual didn't really clear it up for me.
spades.py -1 Merged_MG1655_runs1and2_R1.fastq.gz -2 Merged_MG1655_runs1and2_R2.fastq.gz -o Merged_MG1655_runs1and2_spades_output --only-assembler
Is what I was using. Thanks very much, any help much appreciated.
What you are looking for is reference assisted genome assembly. There are a few suggestions in this thread: Tools and parameters for reference assisted eukaryotic genome assembly using a draft genome as reference These programs do need a de novo assembly so you are on the right track.
IDBA-Hybrid is another example.
If the assembly is that bad by standard spades there may be a problem with the data. I would also suggest aligning reads to the related published reference sequence. This is always helpful in my experience. You can also map your assembled contigs to the reference sequence, then call structural variations, and finally inspect these contigs and calls very carefully.
I would actually expect you to have 100-200+ but not 2000 contigs from an Illumina paired-end de novo seq project.
I have used bwa mem to map my contigs back onto my reference genome and looked at it using Artemis. What do you mean call structural variations, and which program would you recommend for this?
Probably Mb?