Question: Effect of Read Length?
gravatar for jfontana317
20 months ago by
United States
jfontana3170 wrote:

For the past month or so I've been using discoSNP++ with test data sets to see if it will work for particular needs. (It works great!!). The test data sets I have been using are 100bp reads. For my future experimental data sets I am wondering about the need for 100bp reads. It would be easier to piggyback my sequencing with other experiments if I used 50bp reads, but I'm wondering about the effect this would have on the efficiency of the program to detect the SNPs I'm looking for. Can you comment on the effects of using 50bp vs 100bp reads (let's assume ~40M reads per set)? If it matters for your answer, I've been using  "-b 1 -D 0 -P 1 -k 31 -c 4 -C 2147483647 -d 1"



gatb discosnp • 663 views
ADD COMMENTlink modified 20 months ago by pierre.peterlongo600 • written 20 months ago by jfontana3170

Added a "gatb" tag.

ADD REPLYlink written 20 months ago by lh329k
gravatar for pierre.peterlongo
20 months ago by
pierre.peterlongo600 wrote:



Thanks for the question,

The variant prediction phase of discoSnp is based only on k-mers (with k=31 by default). Thus if all k-mers from 100 bp reads also exist with 50 bp reads, the result should be the same.


However, the read coverage with 50 bp reads must be higher than with 100 bp reads for obtaining a similar set of k-mers. This is due to the following reason: A read of length L contains L-k+1 k-mers of length k.

This means that, with k=31, a read of length 100 contains 60 k-mers while a read of length 50 contains only 20 kmers. Thus, in broad terms, the coverage with L=50 should be three times bigger than the coverage with k=100 for obtaining the same results.


Best, Pierre

ADD COMMENTlink written 20 months ago by pierre.peterlongo600
gravatar for Chris Miller
20 months ago by
Chris Miller18k
Washington University in St. Louis, MO
Chris Miller18k wrote:

There are many portions of the genome that are unalignable with 50bp reads. I imagine these are more problematic when trying to do reference-free assembly.

If you really want the answer, though, you should run a test. Take one of your current data sets, chop each read down to 50 bp, then run the algorithm again and compare to the original results.

ADD COMMENTlink written 20 months ago by Chris Miller18k

Thank you for the suggestion Chris. That was a great idea. Re-run with all the same parameters, the 50bp data sets only picked up 1/10th the SNPs that the 100bp data sets did. It also missed 25% of my artificially introduced SNPs. I will play a bit with parameters and see what happens, but that really helped a lot. Thanks again. 

ADD REPLYlink written 20 months ago by jfontana3170
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 715 users visited in the last hour