Question: Too many contig after genome assembly with Spades
0
gravatar for ahmad mousavi
15 months ago by
ahmad mousavi470
Royan Institute, Tehran, Iran
ahmad mousavi470 wrote:

Hi

I have done Bacterial genome sequencing using Illumina Hiseq PE *150b , my library contains 600k reads, But after assembling with spades ( Kmer = -k 21,33,55,77,99,111,127) the result is too bad. I have got ~3400 contigs. I have no ref. genome for my bacteria now, we only know its family. My GC content = ~70%

What is your suggestion for decreasing number of contigs? Is there any other options better than Spades for bacteria genome assembly?

Thanks

sequence assembly genome • 1.1k views
ADD COMMENTlink written 15 months ago by ahmad mousavi470
3

Not a bioinformatics solution, but your assembly could greatly improve by adding some long read sequencing data from Oxford Nanopore or PacBio, of which the former (MinION) can be reasonably cheap to obtain.

ADD REPLYlink modified 15 months ago • written 15 months ago by WouterDeCoster43k
1

A GC content that high probably also means its repetitive. It’s likely to be a sequencing nightmare. Your only options are to sequence deeper, and use other technologies as Wouter said.

ADD REPLYlink written 15 months ago by Joe16k

we all agree with Wouter :), but would a high GC not indicate less repetitive? TE (transpsoson?) are usually rather high in AT, so that would lower the overal GC, no?

ADD REPLYlink written 15 months ago by lieven.sterck6.9k
1

I was more thinking of consecutive repeats (e.g. GCGCGCGCGCG), rather than IS etc, which would fail to be picked up properly by the sequencer.

Nevertheless, there are other issues with high GC - the increased strand separation energy might be an issue for library preps and the actual sequencing reaction.

ADD REPLYlink modified 15 months ago • written 15 months ago by Joe16k

ah, ok, yep agreed in that case.

and totally on the problems (regardless of the 'cause') when extracting/lib-prep/sequencing in high GC situations

ADD REPLYlink written 15 months ago by lieven.sterck6.9k

Would it be possible to provide some more info on your project? eg. estimate genome size (what is the expected coverage)? is it some 'weird/exotic' bacterium?

ADD REPLYlink written 15 months ago by lieven.sterck6.9k

Sorry, I have no idea, we estimate genome size is ~7Mb, just estimation. We tried to have 100x coverage.

ADD REPLYlink modified 15 months ago • written 15 months ago by ahmad mousavi470

so that will give you roughly 25x , on the low side but doable I think

ADD REPLYlink written 15 months ago by lieven.sterck6.9k

It seems you have used several k-mer sizes. Is the contig number same across all the K-mer sizes? ahmad mousavi

ADD REPLYlink written 15 months ago by cpad011212k

Spades let you to define several kmers and it automatically select one based on data structure. So I have constant no. of contigs.

ADD REPLYlink written 15 months ago by ahmad mousavi470

did you have a look at fastg files and the number of contigs for each kmer? You can also check how good your assembly with Bandage https://github.com/rrwick/Bandage. ahmad mousavi

ADD REPLYlink modified 15 months ago • written 15 months ago by cpad011212k

No, I don't understant of relationship of fastq file.

With smaller kmer I got more contigs.

ADD REPLYlink written 15 months ago by ahmad mousavi470

not fastq, it is fastg (updated the post). Spades outputs contigs for each kmer. With higher Kmer, contig number goes down. But the relevancy of such assembly is in question. For that reason, you may need to use software like bandage or quast/ICARUS to identify the relevant assembly

ADD REPLYlink modified 15 months ago • written 15 months ago by cpad011212k
1

SPAdes automatically chooses optimal Kmers. The contigs.fasta that you get output which is not inside one of the K*** folders should be the ‘optimal’ assembly (if I remember correctly).

Optimal doesn’t necessarily mean fewest contigs though.

ADD REPLYlink written 15 months ago by Joe16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 942 users visited in the last hour