Question

Which Illumina data to generate for constructing benchmark SNP / SV genotypes

0

Entering edit mode

8.2 years ago

William ★ 5.3k

What is currently the highest quality data you can reliably and with reasonable effort / cost generate on the Illumina platform? The data will be aligned versus a reference.

For a single diploid individual with a sub 1 GB genome size.

This with the purpose of constructing genome wide benchmark SNP / SV genotypes for that individual. And later use of the benchmark genotypes for optimizing other low cost / high throughput genotyping methods: See also: http://www.nature.com/nbt/journal/v32/n3/full/nbt.2835.html

Two important factors I guess are maximizing the read lenght to 250 bp and generating above 40 X coverage. 250 bp would mean sequencing on the MiSeq as no other Illumina platform supports this.

What I am less sure about is the library type.

Standard paired end is the easiest and cheapest to generate. But is (very) limited for SV detection.

Paired end with overlapping reads offers longer synthetic reads which are more useful for small SV detection. But stitching the reads together can introduce noise for SNP calling?

Mate pair library prep is more difficult and expensive. The library prep even often fails in a lot of hands?

But mate pair offers somewhat (how much?) increased SV detection. Is it the best to choose one specific mate pair insert size, 3kbp , 5 kbp or 10kbp? Or a mix of these insert sizes?

Big question I think is if PE stitching or MP libraries are worth the cost / effort versus standard (higher coverage) PE for creating SNP / SV benchmark genotypes.

Or to just go with another method / platform (PacBio/ Nanopore, BioNano?) for creating SV benchmark genotypes.

illumina benchmarking SNP SV • 1.7k views

ADD COMMENT • link 8.2 years ago by William ★ 5.3k

score 1 · Accepted Answer · 2016-03-24

Hi!

Two important factors I guess are maximizing the read lenght to 250 bp and generating above 40 X coverage. 250 bp would mean sequencing on the MiSeq as no other Illumina platform supports this.

This would be great. See if you can afford that and how much it improves your mapping (this is also function of the genome complexity). In my opinion I could trade to pay much less and have 2125 instead of 2250 reads.

Standard paired end is the easiest and cheapest to generate. But is (very) limited for SV detection.

Actually, they are not so bad for variant detection.

But stitching the reads together can introduce noise for SNP calling?

Not that I know.

Mate pair library prep is more difficult and expensive. The library prep even often fails in a lot of hands?

But mate pair offers somewhat (how much?) increased SV detection. Is it the best to choose one specific mate pair insert size, 3kbp , 5 kbp or 10kbp? Or a mix of these insert sizes?

Are you going to use a de-novo assembly approach? Otherwise, in my opinion, mate pair will not be so helpful. They are not to be preferred to paired-end. If you want you can use them to integrate the data, but it will cost a lot of effort and bring a moderate improvement.

To summarize. If you have to find a set of reliable SNPs/SVs (and you do not necessarily have to find ALL the variants), then a good experiment using only PE is the best option. You could do at least a couple of libraries to increase complexity and you're set. Also, you have to think of the detection methods. Many tools for SV detection works fairly well with PE (with relatively large insert size, so NOT overlapping) and not so well with MP. If you have a de-novo assembly step you might need mate-pair.