Bacterial genome assembly for comparative genomic analysis
1
1
Entering edit mode
20 months ago
artgrigorov ▴ 30

Hi everyone! I'm quite new in this field, so I need help because I don't understand whole pipeline for my task. My lab has sequenced (Illumina, paired-end) two strain of M.tuberculosis. One of them expected to be the control, second contains mutations which I have to find. These mutations could be snp, deletions or large translocations. I tried to assemble genomes de novo using SPAdes (within unicycler) but there are a lot of contigs and it's difficult to compare between. Now I began to think that I can use information about the M.tuberculosis genome from ncbi (my control strain should be very similar to that one). But I don't really understand, is it correct in this case? If so, then I should use reference guided assembly and provide m.tuberculosis as trusted contigs to SPAdes? Or should I just mapped my final contigs on reference genome? The second my thought was just to sequence my strains again but using nanopore to generate long reads and finalize assembly. Please, tell me which pipeline should I use in my case? How can I find differences without accidentally losing information during assembly? Thanks to all!

assembly genome alignment software • 1.0k views
1
Entering edit mode

Have you checked your assemblies with quast (LINK)? If you can post stats on your assemblies here it may be possible for us to give you some advice. How much data did you use for the assemblies? Having too much data can be detrimental to getting good assemblies, contrary to popular belief.

The second my thought was just to sequence my strains again but using nanopore to generate long reads and finalize assembly.

If your assemblies can't be improved further then this may be what you would need to do.

But even then it may still be possible to identify SNP's and other variations from the data you have in hand, as long as you can identify genes/regions with certainty. You may not get a complete answer but at least a usable one.

0
Entering edit mode

Thank you for your answer! Yes, here I attach quast report for my two assemblies: enter link description here It seems to me that assemblies are quite good - I have about 110 contigs for each sample (I used about 6 and 4 million pairs of 100 bp reads correspondingly). And my N50 - 125378 and 125378 (genome length - 4.4 Mb). I don't understand completely how to interpret N50 value, I know that the bigger the better, but that's all. Right now I tried to compare my assemblies with each other and with reference genome using Mauve, but it looks not very nice - a lot of misassembled contigs.

0
Entering edit mode

Something weird is going on, you have two good assemblies it shouldn't look that scrambled. Did you map the contigs onto the reference genome before align then with mauve ? It appears that you did not reorder the contigs of your assemblies with the reference, because they kind of look like ordered by size (as they came out of spades). 🤔

1
Entering edit mode

Yes, thank you! Of course, It was stupid. Now it's much better.

0
Entering edit mode

hahahah don't say that everybody makes mistakes. Now you can map with an aligner (like bowtie2) your reads onto your assemblies to see if those recombination blocks are real or miss assemblies.

2
Entering edit mode
20 months ago
hugo.avila ▴ 320

Hi ! Do not use a genome of reference as a trusted contig in spades it will probably insert some erros in your assemble. Try to do some In silico gap filling (here another awnser that may help you) to try to close your control genome. But if you have a good N50 you don't really need to close it or make a schaffold to find de novo mutations. You can do variant calling with contigs. Some times the processes of closing a genome in silico can insert some errors that are mistaken by de novo mutations, so do consider if you really need to close your genome.

1
Entering edit mode

Thank you, I will check links! My N50 - 125378 and 125378 (genome length - 4.4 Mb). I don't understand if this is a good value or not very much anyway. And most importantly, I don't really want to get lost in major structural changes in the genome in the In silico gap filling process.