Question

Running Unicycler with a merged fastq file

1

Entering edit mode

5 months ago

Assa Yeroslaviz ★ 1.9k

I was wondering what the reason behind the --kmers parameter in the unicycler pipeline. ( I know it is used in the SPAdes assembly, but I want to better understand what it is needed for).

The reason for that is our inout fastq files. We have samples which were sequenced twice, the second time was done to increase the sequencing depth. Unfortunately the second run was done on a different machine and with a different read length.

When I try to run unicycler on the merged fastq files, it fails when it calculating the --kmers automatically, trying to also use a length of 61nt. The first sequencing run produced fastq file of only 60nt length.

Does it make more sense to give the too a specific list of parameters e.g. --kmers 13,25,33,39,45,49,53,57 or is it better to work with the files separately. What does it means for the analysis, if I use kmers not as long as the reads can give me?

thanks

Assa

P.S. I also asked it on the github repo, but I don't think it is still very active there.

unicycler bacterial-genome assembly • 905 views

ADD COMMENT • link updated 5 months ago by GenoMax 154k • written 5 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Unfortunately the second run was done on a different machine and with a different read length.

While you wait for answers, one easy thing to try is the trim the longer dataset down so that the sizes of all reads become identical. You will lost some data but process should go on.

ADD REPLY • link 5 months ago by GenoMax 154k

0

Entering edit mode

thanks for the suggestion. The run does work with the given kmers.

This is exactly my question. What would be the difference (or what would be better) - do trim the fastq files and loose information or to use less kmers for the the calculation of the assemblies.

I would like to understand what these kmers are helping in.

ADD REPLY • link 5 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

What would be the difference (or what would be better) - do trim the fastq files and loose information or to use less kmers for the the calculation of the assemblies.

I thought that having different sized reads is preventing you from completing the run so trimming the data would help remove that barrier.

You don't say what is the size difference between the two datasets but assuming that the libraries were randomly made (not enriched in any way) there should be no bias from prep for sequence representation.

I would like to understand what these kmers are helping in.

Ask your favorite LLM: "what is the contribution of kmers in assembly of DNA sequence". They are mainly used for building the de Bruijn graphs, detecting sequence overlaps and for error correction. True k-mers will appear in the data often, error ones will be less frequent.

ADD REPLY • link 5 months ago by GenoMax 154k

0

Entering edit mode

sorry for the misunderstanding. After setting the number of kmers to be below the length of the shorter reads, unicycler runs smoothly (the libraries are 60nt and 75nt long). I was trying to understand it it makes more sense to trim the longer reads or to set the length of kmers. Which of the two is the better approach?

ADD REPLY • link 5 months ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

If you are able to get unicycler to run by limiting max k-mer length then that would be the best option since you won't lose the additional data.

ADD REPLY • link 5 months ago by GenoMax 154k