I was wondering what the reason behind the --kmers
parameter in the unicycler
pipeline. ( I know it is used in the SPAdes assembly, but I want to better understand what it is needed for).
The reason for that is our inout fastq files. We have samples which were sequenced twice, the second time was done to increase the sequencing depth. Unfortunately the second run was done on a different machine and with a different read length.
When I try to run unicycler
on the merged fastq files, it fails when it calculating the --kmers
automatically, trying to also use a length of 61nt. The first sequencing run produced fastq file of only 60nt length.
Does it make more sense to give the too a specific list of parameters e.g. --kmers 13,25,33,39,45,49,53,57
or is it better to work with the files separately. What does it means for the analysis, if I use kmers not as long as the reads can give me?
thanks
Assa
P.S. I also asked it on the github repo, but I don't think it is still very active there.
While you wait for answers, one easy thing to try is the trim the longer dataset down so that the sizes of all reads become identical. You will lost some data but process should go on.
thanks for the suggestion. The run does work with the given kmers.
This is exactly my question. What would be the difference (or what would be better) - do trim the fastq files and loose information or to use less kmers for the the calculation of the assemblies.
I would like to understand what these kmers are helping in.
I thought that having different sized reads is preventing you from completing the run so trimming the data would help remove that barrier.
You don't say what is the size difference between the two datasets but assuming that the libraries were randomly made (not enriched in any way) there should be no bias from prep for sequence representation.
Ask your favorite LLM: "what is the contribution of kmers in assembly of DNA sequence". They are mainly used for building the de Bruijn graphs, detecting sequence overlaps and for error correction. True k-mers will appear in the data often, error ones will be less frequent.
sorry for the misunderstanding. After setting the number of kmers to be below the length of the shorter reads,
unicycler
runs smoothly (the libraries are 60nt and 75nt long). I was trying to understand it it makes more sense to trim the longer reads or to set the length of kmers. Which of the two is the better approach?If you are able to get
unicycler
to run by limiting max k-mer length then that would be the best option since you won't lose the additional data.