Question: Effect of Read Length?
gravatar for jfontana317
4.6 years ago by
United States
jfontana3170 wrote:

For the past month or so I've been using discoSNP++ with test data sets to see if it will work for particular needs. (It works great!!). The test data sets I have been using are 100bp reads. For my future experimental data sets I am wondering about the need for 100bp reads. It would be easier to piggyback my sequencing with other experiments if I used 50bp reads, but I'm wondering about the effect this would have on the efficiency of the program to detect the SNPs I'm looking for. Can you comment on the effects of using 50bp vs 100bp reads (let's assume ~40M reads per set)? If it matters for your answer, I've been using  "-b 1 -D 0 -P 1 -k 31 -c 4 -C 2147483647 -d 1"



gatb discosnp • 1.7k views
ADD COMMENTlink modified 4.6 years ago by pierre.peterlongo860 • written 4.6 years ago by jfontana3170

Added a "gatb" tag.

ADD REPLYlink written 4.6 years ago by lh332k
gravatar for pierre.peterlongo
4.6 years ago by
pierre.peterlongo860 wrote:



Thanks for the question,

The variant prediction phase of discoSnp is based only on k-mers (with k=31 by default). Thus if all k-mers from 100 bp reads also exist with 50 bp reads, the result should be the same.


However, the read coverage with 50 bp reads must be higher than with 100 bp reads for obtaining a similar set of k-mers. This is due to the following reason: A read of length L contains L-k+1 k-mers of length k.

This means that, with k=31, a read of length 100 contains 60 k-mers while a read of length 50 contains only 20 kmers. Thus, in broad terms, the coverage with L=50 should be three times bigger than the coverage with k=100 for obtaining the same results.


Best, Pierre

ADD COMMENTlink written 4.6 years ago by pierre.peterlongo860
gravatar for Chris Miller
4.6 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

There are many portions of the genome that are unalignable with 50bp reads. I imagine these are more problematic when trying to do reference-free assembly.

If you really want the answer, though, you should run a test. Take one of your current data sets, chop each read down to 50 bp, then run the algorithm again and compare to the original results.

ADD COMMENTlink written 4.6 years ago by Chris Miller21k

Thank you for the suggestion Chris. That was a great idea. Re-run with all the same parameters, the 50bp data sets only picked up 1/10th the SNPs that the 100bp data sets did. It also missed 25% of my artificially introduced SNPs. I will play a bit with parameters and see what happens, but that really helped a lot. Thanks again. 

ADD REPLYlink written 4.6 years ago by jfontana3170
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1863 users visited in the last hour