Question: SPAdes output with assembling warning
0
gravatar for DanielC
12 months ago by
DanielC90
Canada
DanielC90 wrote:

Dear Friends,

I ran SPAdes on an ion torrent bam file with single-reads of a bacteriophage sample using the command line:

spades.py --iontorrent -k 21,33,55,77,99,127 --only-assembler -s IonXpress.bam -o spades-out --careful --mismatch-correction

After the run was finished, there were error correction and assembling warnings:

=== Error correction and assembling warnings:
 * 0:02:06.148   464M / 9G    WARN    General                 (kmer_coverage_model.cpp   : 327)   Valley value was estimated improperly, reset to 1
 * 0:02:06.150   464M / 9G    WARN    General                 (kmer_coverage_model.cpp   : 366)   Failed to determine erroneous kmer threshold. Threshold set to: 1
 * 0:01:09.757   692M / 9G    WARN    General                 (kmer_coverage_model.cpp   : 327)   Valley value was estimated improperly, reset to 10
 * 0:01:09.758   692M / 9G    WARN    General                 (kmer_coverage_model.cpp   : 366)   Failed to determine erroneous kmer threshold. Threshold set to: 10
 * 0:00:51.891   688M / 9G    WARN    General                 (kmer_coverage_model.cpp   : 218)   Too many erroneous kmers, the estimates might be unreliable

Could you please let me know if I should be bothered about these warnings and what these warnings mean, and if there is a way these warnings can be eliminated and the efficiency of the assembling can be improved.

Thanks so much!

ADD COMMENTlink modified 12 months ago • written 12 months ago by DanielC90

Have you looked at the resulting assembly?

Warnings are usually just that, warnings. They can generally be ignored, if you know what you're doing. They are not the same as errors.

If the assembly you got back looks decent enough for your purposes, you can probably proceed.

To sanity check your assembly, check the usual assembly statistics (N50, number of contigs etc), and map the reads back to the assembly. You may want to check for any variants that the reads support that arent in the assembly or vice versa, just in case these weird kmer errors affected the assembly in any way.

ADD REPLYlink written 12 months ago by Joe14k

Thank you very much for your response! I looked at the result and I understand that good assembly should have less number of contigs and high N50 value. When looking at the results of the Kmers I found that K77 have the least number of contigs with high N50 like this:

K77 mer:

Assembly        final_contigs
# contigs (>= 0 bp)     2541
# contigs (>= 1000 bp)  21
# contigs (>= 5000 bp)  0
# contigs (>= 10000 bp) 0
# contigs (>= 25000 bp) 0
# contigs (>= 50000 bp) 0
Total length (>= 0 bp)  361612
Total length (>= 1000 bp)       32321
Total length (>= 5000 bp)       0
Total length (>= 10000 bp)      0
Total length (>= 25000 bp)      0
Total length (>= 50000 bp)      0
# contigs       79
Largest contig  2749
Total length    74890
GC (%)  41.75
N50     966
N75     736
L50     27
L75     49
# N's per 100 kbp       0.00

Please let me know what you think of this result.

Thanks much!

ADD REPLYlink modified 12 months ago • written 12 months ago by DanielC90

A largest contig of 2749 seems quite poor to me. Even for something like a bacteriophage, I would have expected better. It seems you will have a large number of contigs of about 1000bp or less, which generally speaking are not to be trusted.

Do you know what total genome size you’re expecting? Did you do much/any QC on your reads before assembly?

ADD REPLYlink written 12 months ago by Joe14k

The total size of the bacteriophage genome is generally about 40,000 bp. And, yes, I did not do the QC, I wanted to see how the reads perform without QC. Do you think QC will improve the results? Pleas let me know of your suggestions.

ADD REPLYlink written 12 months ago by DanielC90
1

You should do the QC first. It might explain your SPAdes warnings. There is obviously something up with the Kmer distribution in your data.

That could be because its a small genome, but it could also be dodgy reads.

I doubt QC will improve your results specifically, its more likely to tell your what your data died of.

There may be some things you can do like removing adapters, if they weren’t already removed. What is your depth of sequencing? It’s possible that for a small genome, you may have very deep coverage, which can cause assemblers to choke, so you may also see improvement from downsampling your reads.

Other people might have some alternative ideas, but otherwise I think it’s probably a case of resequencing the sample, it could just be a bad library.

ADD REPLYlink written 12 months ago by Joe14k

Thank you very much! Could you please suggest what QC package would be best for such data?

ADD REPLYlink written 12 months ago by DanielC90
1

The best tool I know of is MultiQC, but it only aggregates results from other QC tools.

FastQC is the obvious starting point, but its kmer warnings have been known to be spurious for a long time.

Start on the MultiQC website, and run as many of the tools they aggregate as possible would be my suggestion! It supports 68 different tools, as well as providing summary stats of its own, all of which can be seen on their site: https://multiqc.info/

ADD REPLYlink written 12 months ago by Joe14k

Thank you for suggestions! I ran ClinQC (https://sourceforge.net/p/clinqc/wiki/ClinQC_Manual/) with FastQC and the results after quality control are:

the best K-mer result (low number of contigs and high N50) is for K99 using SPAdes:

Assembly                        final_contigs
# contigs (>= 0 bp)             2370
# contigs (>= 1000 bp)          29
# contigs (>= 5000 bp)          0
# contigs (>= 10000 bp)         0
# contigs (>= 25000 bp)         0
# contigs (>= 50000 bp)         0
Total length (>= 0 bp)          407931
Total length (>= 1000 bp)       40828
Total length (>= 5000 bp)       0
Total length (>= 10000 bp)      0
Total length (>= 25000 bp)      0
Total length (>= 50000 bp)      0
# contigs                       79
Largest contig                  2747
Total length                    77347
GC (%)                          41.51
N50                             1022
N75                             781
L50                             27
L75                             49
# N's per 100 kbp               0.00

When compared to the previous result without QC (posted above), I don't see much difference. Your suggestions will be highly appreciated - if anything more can be done to improve the contig assembly. Do you think I should run MultiQC too for QC?

Thanks much!

ADD REPLYlink written 11 months ago by DanielC90
1

Honestly, my suspicion is that your input DNA library was probably not the best. Either that or that genome is incredibly repetitive (not beyond the realms of possibility for a bacteriophage).

The assembly you have might be sufficient for your needs, though I think what you're looking at there is just poor input data, which means poor assemblies out the other end.

It probably wants re-sequencing. It may be that the DNA you had input at the start was overly fragmented.

Have you examined the insert size distribution? (A tool like Quast will do this).

ADD REPLYlink written 11 months ago by Joe14k

Thanks much for your reply Healey! It is very informative. I ran quast and got the above result. Could you please tell me how to figure the insert size using Quast?

ADD REPLYlink written 11 months ago by DanielC90
1

Sorry, my mistake. I meant Qualimap, not Quast.

ADD REPLYlink written 11 months ago by Joe14k

Thanks much! Curious to know - here by "insert size" you mean DNA fragments put for sequencing?

ADD REPLYlink written 11 months ago by DanielC90
1

Yep, I’m wondering if you had many fragments that were shorter than they should have been.

ADD REPLYlink written 11 months ago by Joe14k

Hi,

Thanks for your previous helpful comments! I have got the contigs from spades and now looking to get the coverage. I have got this below result by using "bbmap.sh" on my data to get the coverage. Could you please tell me how to interpret the coverage here? Average coverage shown here is 209.654 - what does this mean? I would really appreciate your input.

command line:

bbmap.sh in=IonXpress.fastq ref=final_contigs.fasta covstats=covstats.txt

covstats.txt

 Genome:                 1
    Key Length:             13
    Max Indel:              16000
    Minimum Score Ratio:    0.56
    Mapping Mode:           normal
    Reads Used:             1821574 (553015125 bases)

    Mapping:                1648.545 seconds.
    Reads/sec:              1104.96
    kBases/sec:             335.46


    Read 1 data:            pct reads       num reads       pct bases          num bases

    mapped:                  99.2931%         1808698        99.2358%          548788917
    unambiguous:             80.4977%         1466326        83.8068%          463464066
    ambiguous:               18.7954%          342372        15.4290%           85324851
    low-Q discards:           0.0000%               0         0.0000%                  0

    perfect best site:        4.4512%           81082         2.4810%           13720350
    semiperfect site:        17.7655%          323611        14.4467%           79892367

    Match Rate:                   NA               NA        10.2835%          499947774
    Error Rate:              90.2312%         1639609        88.7859%         4316473287
    Sub Rate:                28.4581%          517117         0.0276%            1340991
    Del Rate:                68.6627%         1247684        88.7119%         4312877825
    Ins Rate:                53.1733%          966223         0.0464%            2254471
    N Rate:                  57.9214%         1052501         0.9307%           45245681

    Reads:                                  1821574
    Mapped reads:                           1529439
    Mapped bases:                           408501909
    Ref scaffolds:                          9236
    Ref bases:                              1948456

    Percent mapped:                         83.963
    Percent proper pairs:                   0.000
    Average coverage:                       209.654
    Standard deviation:                     645.173
    Percent scaffolds with any coverage:    76.09
    Percent of reference bases covered:     77.89

Thanks!

ADD REPLYlink modified 11 months ago • written 11 months ago by DanielC90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1895 users visited in the last hour