Hello everyone,
I have paired end 100 bp long RNA -seq data of lumpy skin disease virus, I have checked its quality using fastqc/falco, the per base sequence quality looks good (phred quality > 30), just one fail (only failed for per base sequence content) and three warnings (Per sequence GC content, Sequence Length Distribution, and Overrepresented sequences). I have aligned the reads against cow genome to separate the unaligned reads to perform assembly mapping against Lumpy skin disease virus genome.
Then the unaligned reads were used to perform the genome assembly using Spades (on Galaxy), it generates around 14425 contigs/scaffolds with default kmers and shows the following warnings:
=== Error correction and assembling warnings:
- 0:00:40.403 34M / 1326M WARN General (kmer_coverage_model.cpp : 219) Too many erroneous kmers, the estimates might be unreliable
- 0:01:07.324 34M / 1326M WARN General (kmer_coverage_model.cpp : 328) Valley value was estimated improperly, reset to 2
- 0:01:07.326 34M / 1326M WARN General (kmer_coverage_model.cpp : 367) Failed to determine erroneous kmer threshold. Threshold set to: 2
Then I provided the user-defined kemers (33,55,77,85,99) based on literature research for lumpy skin disease virus and got 11509 scaffolds and following warnings:
======= SPAdes pipeline finished WITH WARNINGS!
=== Error correction and assembling warnings:
- 0:00:50.600 33M / 2055M WARN General (kmer_coverage_model.cpp : 219) Too many erroneous kmers, the estimates might be unreliable
- 0:01:09.683 34M / 2055M WARN General (kmer_coverage_model.cpp : 328) Valley value was estimated improperly, reset to 2
- 0:01:09.684 34M / 2055M WARN General (kmer_coverage_model.cpp : 367) Failed to determine erroneous kmer threshold. Threshold set to: 2
- 0:00:36.714 43M / 1358M WARN General (kmer_coverage_model.cpp : 328) Valley value was estimated improperly, reset to 3
- 0:00:36.715 43M / 1358M WARN General (kmer_coverage_model.cpp : 367) Failed to determine erroneous kmer threshold. Threshold set to: 3
- 0:00:46.897 48M / 1358M WARN General (simplification.cpp : 517) The determined erroneous connection coverage threshold may be determined improperly
- 0:00:30.535 46M / 1215M WARN General (kmer_coverage_model.cpp : 328) Valley value was estimated improperly, reset to 8
- 0:00:30.536 46M / 1215M WARN General (kmer_coverage_model.cpp : 367) Failed to determine erroneous kmer threshold. Threshold set to: 8
I have checked the scaffolds stats, and it shows that maximum length is 432252 with coverage 0.42.
I have also tried to perform assembly using Spades based on raw fastq files, but it doesn't make much difference in terms of assembly or mapping
I have tried to check the reason of these warnings, and it mostly navigates to poor DNA quality or read quality but in my case, the per-base sequence quality looks reasonable, so I am not sure how I can improve assembly. Any suggestions will be highly appreciated.
You may have too much data going into the assembly. This virus appears to be only ~150 kilobases (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_031470205.1/ ). You may need to downsample the data (start with ~20-30x) and see if that helps.
Otherwise it may be easier to simply align the reads you selected to the reference(s) available from NCBI and check coverage.