Question

DiscoSnp++, question about how SNPs called

2

Entering edit mode

10.3 years ago

Adrian Pelin ★ 2.7k

Hello,

Thank you for the previous answers to my questions here (https://www.biostars.org/p/135443/#136088). I have a couple of follow up questions.

Once SNPs are discovered, predictions are listed in the fasta files, and they are typically 60bp, with the SNP located in the middle of the prediction. This is true whether I use 100bp reads or 50bp reads. Now, coherence is established by splitting the 60bp prediction into k-mers (k=31) and assesing their k-mer coverage by all samples allowing one mismatch. And if at least one sample has a k-coverage less than 4, the SNP is considered uncoherent.

I was wondering if the length of the prediction can have an effect on coherency? Should I try to lower the k-mer size for shorter reads?

I was also wondering what is the effect of various coverage in different samples? Should we aim to subsample to the same coverage level?

Lastly, I tried genotype discovery with a -P of 6, are the genotypes the ones reported in the fasta file? In some regions I expect more than 2 genotypes.

Thank you,

Adrian

NGS PopGen DiscoSNP VCF • 2.3k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Adrian Pelin ★ 2.7k

Ram · Accepted Answer · 2015-03-30

1

Entering edit mode

10.3 years ago

pierre.peterlongo ▴ 900

Hi Adrian.

Without options -t or -T the length of the prediction is 2k-1 (should be 61 with k=31). With bigger k, the prediction is bigger. However, each prediction (without contigs/unitigs extensions) contains exactly k k-mers. Thus the number of tested positions for the coherency remains the same with big or small k-mers. By the way, longer kmers may be more difficult to exist in the reads.

I invite you to read the NAR paper in which the effect of kmer lengh is analysed on a set of human reads.

Currently the coverage threshold is applied per read set. However it is not possible yet to fix a coverage per read set. This improvement is necessary and we are currently working on this feature. If your coverages are high and highly dissimilar, it may be a good idea to subsample larger read sets.

Your last question is a bit more tricky. The -P behavior depends on the complexity of the combinations of SNPs. Clearly with -b 0 or -b 1, SNPs ---A----C---- + ---A----G---- + ---T----C---- + ---T----G---- for instance are not detected. In this case, with -b 2 variants are independently detected: (-A----C---/-A----G---), (-T----C---/-T----G---), (---A----C-,---T----C-), and (---A----G-,---T----G-) are detected --> the 2 first detect the C/G SNP with the two possible contexts (A and T) and the two last detects the variant A/T either with context C or G.

Another possibility is ---A----C---- + ---A----G---- + ---T----C---- (without ---A----G-,---T----G-). In this case b 0 detects nothing, while b 1 or b2 detect (-A----C---/-A----G---), and (---A----C-,---T----C-).

I hope this helps.

Best, Pierre

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by pierre.peterlongo ▴ 900

1

Entering edit mode

Thank you for the response! A bit hard to visualize all of this so I inspected the data. I notice right away that the prediction length is 2k-1 under default parameters, and all predictions are the same length. However, when -P or -b are changed, we see that not all predictions have the same length, and we do not necessarily have 2k-1 as a length of prediction. With -P 6 I see the minimum prediction is 61 and maximum is 149. With -P 6 -b 1, predictions range from 48 to 186bp. I have not yet started playing with -T or -t. I am running -b 2 now, but it seems to take a while, probably expected.

My dataset consists of 8 populations sequenced at 100bp PE. However, I feed the reads to DiscoSNP as a single file and it treats them all as single end, should I split them up and feed them as PE?

Since the coverage of the genome of interest are variable, I mapped the reads to the genome (bwa aln -n 0.1), extracted them, and subsampled to the lowest coverage, 27x, now they all have even coverage which is easier to work with. I had to map to the genome also because aside from my genome of interest (a microorganism), there are also other genome in those populations, contaminants. Let me know if this was not a good idea.

I started by analyzing variants of rank=1.0 under default parameters, i.e. simply running discoSNP with the reads as argument. I found 3 rank=1.0 variants. What I noticed, is that for one of these Variants, the reference sequence in the assembled genome is haplotype "T-C-C". The predicted high and low are "A-T-T" and "A-C-T" respectively. What is odd is the fact that the reference haplotype did not get assembled into a prediction, I checked all predictions. After all, this reference haplotype is the predominant form and has reads supporting it in all populations. I then went back to the other parameter predictions, and found that the reference haplotype was present when using -P 6 -b 1.

I am also really interested in finding homozygous SNPs between populations, such that you only have allele X in some populations and allele Y in others. The allele X and Y should be homozygous, that is, no other alleles should be present in those populations. Any advice on how to go about finding such markers?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

Hi Adrian, sorry for the delay.

If your dataset is composed of 8 (or 2*8) read files, the best is to feed disco with these separated files (take care to use option -m and to indicate paired files consecutively in the command line). You'll obtain a much richer result (coverage per read file, rank w.r.t ability of each variant to discriminate read sets).

I think your mapping approach is a nice clever idea. You may reduce c to -c 2 or -c 3 with a 27x coverage.

I don't understand your point about the T-C-C haplotype. In particular I don't understand the A-T-T versus A-C-T haplotypes. Would you mind posting or sending me (pierre.peterlongo@inria.fr) these 3 variants ?

The homozygous variants (with distinct allele in at least 2 read sets) are those having a rank equal or close to one. By feeding discoSnp++ with you 8 individuals, you'll have access to this piece of information.

Pierre

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by pierre.peterlongo ▴ 900