DiscoSNP parameters for complex genomes with reads starting only at specific restriction sites
1
0
Entering edit mode
6.7 years ago
Hans ▴ 130

Hello

I am trying to find SNP using DiscoSNP without a reference genome in wheat which highly complex genome with many repeats. The reads were obtained from reduced representation library where DNA was cut with restriction enzyme and the adaptors only catch the sticky ends of the restriction site. Therefore, no overlap is expected between reads. The DNA was obtained from plants that were selfed several times so low heterozygousity is expected. In a run with b=1 or b= 0 and all the rest of the parameters are default, I see that about 35% of the genotype by SNP data points are heterozygous . It seems to me that these heterozygous data points are caused by paralogs and they are not true SNP. The same results are obtained with TASSEL pipeline. Please recommend on parameters that can reduce the number of false heterozygous and work well with no overlapping reads. Thank you

Hanan

discosnp • 1.9k views
0
Entering edit mode

Hello Pierre

Thank you for the response

Sorry I was not clear about no overlap. The situation is like this:

cut site 1                       cut site 2
>>>>>>>>>>>                      >>>>>>>>>>>
>>>>>>>>>>>                      >>>>>>>>>>>>
>>>>>>>>>>>                      >>>>>>>>>>>>


and not partly overlapping like this

>>>>>>>>>>>>         >>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    >>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>   >>>>>>>>>>>>>


So for every cut site there are many reads that start at the same position but the sequence length is only one read length.

Hanan

0
Entering edit mode

Hi,

Sorry for the misunderstanding.

My answer remains the same about the differentiation between heterozygous from homozygous. (should be done only a posteriori by using the read coverage (or the VCF genotype information)).

If the dataset is small and you're interested by obtaining not some (very confident b 0 or b 1) but all (not so confident) variants, you may use b 2. In this case most of the predictions are false positives, but, with downstream filters on the coverage, this method may be interesting. We are currently testing this on small amplicon datasets.

However, remember that in complex genomes, b 2 may take a lot of time due to large number of possible path to traverse in the graph.

Best regards,

Pierre

0
Entering edit mode

Hi

I am currently trying b 2 and it takes forever, more than 4 days, I have the time to wait. What filters do you suggest?

0
Entering edit mode

Hi,

If I were you, I'd make two tries, one with b 1 and if you'e expecting more variants, retry with b 2. In both cases, in your context, you may apply post treatment filters. I'd let the other default parameters.

Pierre

0
Entering edit mode
6.7 years ago

Hi,

With no overlapping reads, it is difficult impossible to predict precisely SNPs. Each sequencing error has precisely the same signature as a real SNPs and, on the other hand, most SNPs won't be sufficiently covered to be detected.

By the way, differentiating heterozygous from homozygous should be done only a posteriori by using the read coverage (or the VCF genotype information). But in your case, with a coverage of 0 or 1, I'd not be highly confident with this piece of information.

If the discoSnp running time is not too high, you may try b 2 (with the -g option in order to avoid to reconstruct the graph).

Something else: by default, the automatic threshold detection uses c=3 in case of very low coverage. In your case (you may verify this in the logs), if 3 is used, it means that all k-mers seen 1 or 2 times are removed. You may force the threshold detection using for instance -c 1 or -c 2.

Pierre