Question: What is "Total Length" in QUAST?
gravatar for  '
4.6 years ago by
'260 wrote:

I have done genome assembly on an interleaved fastq file using many different assemblers (Velvet, ABySS, Minia, SPAdes, etc.) and have the "contigs.fasta" file from all of them. I have run over 50 assemblies with different parameters and options in each of those assemblers, now I have processed each "contigs.fasta" file using QUAST. I know that the length of the genome I am trying to assemble is originally 200,000. However using QUAST the "Total Length" and "Total length (>= 0 bp)" I am getting for 95% of my assemblies (i.e. contigs.fasta files from different assemblers) is near 390,000 all the time. What is the problem? Does "Total Length" in QUAST refer to something different? Why can't I get any length value near the expected 200,000? I have experimented with tons of k-mer, coverage-cutoff, expected coverage value combinations! 

quast • 2.0k views
ADD COMMENTlink modified 4.6 years ago by thackl2.8k • written 4.6 years ago by '260
gravatar for thackl
4.6 years ago by
thackl2.8k wrote:

Total length in QUAST does not refer to "something else", it simply gives you the total amount of bases present in your assembly (sum of length of all sequences).

What kind of sample are you trying to assembly, and how do you know that the total assembly size should be 200kbp. Is it simulated data?

If not, my guess would be that your sample either also contained "something else", e.g. minor contaminations or that you have a high level of variation and (probably excessive coverage) in your read data.

More information about your actual sample would help a lot.

ADD COMMENTlink written 4.6 years ago by thackl2.8k

Yes! It is simulated data, (probably generated by Matlab, but I am not very sure about that) and all I know about it is that the original length of the genome is 200,000 and the coverage is 50.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by '260

Here are the values I have received running QUAST:

ADD REPLYlink written 4.6 years ago by '260

Okay, I can see, why you are in doubt about the 200kbp :). If the data was simulated, some form of errors/heterogeneity (or maybe repeats) had to be introduced to the data - otherwise the assemblers would not have such a hard time with such a small data set. If you cannot find out how exactly the set was generated, you could run a kmer analysis to a) estimate the expected genome size and b) determine the level of noise - something along those lines:

ADD REPLYlink written 4.6 years ago by thackl2.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 926 users visited in the last hour