I am trying to find what would be the best way to identify a large insertion (~1kb) in a sequence.
Basically, scientists I am working with wants to make a large insertion a genome and then double check that it is there indeed and what are the exact resulting sequences. Because the repair mechanism that is used is a bit random, the limits of the inserts can vary from one chromosome to another. The insertion might have not taken place and I can have a deletion instead on one of the chromosome. Moreover, the experiment will be done on the pool of cells, so we can end up with a pool of insertion/deletion with nested edges (which is not very well supported by variant callers even for smaller deletion/insertion it seems).
The scientists I am working with would like the sequences and the frequencies of the alleles. Outputting the sequences also mean to be able to match the limits of insertions/deletions (the repair mechanism will be random at both ends). So somehow, I have to have a physical link between the edges of my large insertion...
I am not quite sure what to advice in terms of experimental design: maybe sequencing a large amplicon and circularise it so that the edges of the insertion can be physically linked by paired-end reads?? then randomly shear and sequence...
But then, how would I align this?... against what?
I had a look at Structural variants callers but I am afraid they will not handle well populations of sequences and nested insertion/deletions. Any experience on this front please? I have seen http://www.broadinstitute.org/software/genomestrip/download-genome-strip but it does not handle insertions it seems.
I was adviced to assemble (rather than aligning directly) using SPADes. I know nothing about assembling genome. Could it be the way forward?
Any insight from the community would really help!
Thanks a lot!