Question: Too many contig after genome assembly with Spades
0
gravatar for ahmad mousavi
5 months ago by
ahmad mousavi410
Royan Institute, Tehran, Iran
ahmad mousavi410 wrote:

Hi

I have done Bacterial genome sequencing using Illumina Hiseq PE *150b , my library contains 600k reads, But after assembling with spades ( Kmer = -k 21,33,55,77,99,111,127) the result is too bad. I have got ~3400 contigs. I have no ref. genome for my bacteria now, we only know its family. My GC content = ~70%

What is your suggestion for decreasing number of contigs? Is there any other options better than Spades for bacteria genome assembly?

Thanks

sequence assembly genome • 480 views
ADD COMMENTlink written 5 months ago by ahmad mousavi410
3

Not a bioinformatics solution, but your assembly could greatly improve by adding some long read sequencing data from Oxford Nanopore or PacBio, of which the former (MinION) can be reasonably cheap to obtain.

ADD REPLYlink modified 5 months ago • written 5 months ago by WouterDeCoster38k
1

A GC content that high probably also means its repetitive. It’s likely to be a sequencing nightmare. Your only options are to sequence deeper, and use other technologies as Wouter said.

ADD REPLYlink written 5 months ago by jrj.healey12k

we all agree with Wouter :), but would a high GC not indicate less repetitive? TE (transpsoson?) are usually rather high in AT, so that would lower the overal GC, no?

ADD REPLYlink written 5 months ago by lieven.sterck4.5k
1

I was more thinking of consecutive repeats (e.g. GCGCGCGCGCG), rather than IS etc, which would fail to be picked up properly by the sequencer.

Nevertheless, there are other issues with high GC - the increased strand separation energy might be an issue for library preps and the actual sequencing reaction.

ADD REPLYlink modified 5 months ago • written 5 months ago by jrj.healey12k

ah, ok, yep agreed in that case.

and totally on the problems (regardless of the 'cause') when extracting/lib-prep/sequencing in high GC situations

ADD REPLYlink written 5 months ago by lieven.sterck4.5k

Would it be possible to provide some more info on your project? eg. estimate genome size (what is the expected coverage)? is it some 'weird/exotic' bacterium?

ADD REPLYlink written 5 months ago by lieven.sterck4.5k

Sorry, I have no idea, we estimate genome size is ~7Mb, just estimation. We tried to have 100x coverage.

ADD REPLYlink modified 5 months ago • written 5 months ago by ahmad mousavi410

so that will give you roughly 25x , on the low side but doable I think

ADD REPLYlink written 5 months ago by lieven.sterck4.5k

It seems you have used several k-mer sizes. Is the contig number same across all the K-mer sizes? ahmad mousavi

ADD REPLYlink written 5 months ago by cpad011211k

Spades let you to define several kmers and it automatically select one based on data structure. So I have constant no. of contigs.

ADD REPLYlink written 5 months ago by ahmad mousavi410

did you have a look at fastg files and the number of contigs for each kmer? You can also check how good your assembly with Bandage https://github.com/rrwick/Bandage. ahmad mousavi

ADD REPLYlink modified 5 months ago • written 5 months ago by cpad011211k

No, I don't understant of relationship of fastq file.

With smaller kmer I got more contigs.

ADD REPLYlink written 5 months ago by ahmad mousavi410

not fastq, it is fastg (updated the post). Spades outputs contigs for each kmer. With higher Kmer, contig number goes down. But the relevancy of such assembly is in question. For that reason, you may need to use software like bandage or quast/ICARUS to identify the relevant assembly

ADD REPLYlink modified 5 months ago • written 5 months ago by cpad011211k
1

SPAdes automatically chooses optimal Kmers. The contigs.fasta that you get output which is not inside one of the K*** folders should be the ‘optimal’ assembly (if I remember correctly).

Optimal doesn’t necessarily mean fewest contigs though.

ADD REPLYlink written 5 months ago by jrj.healey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1649 users visited in the last hour