I've developed a tool that identifies the points at which insertions are made. This basically means, for various BAM files, I have various lists of coordinates at which I think insertions were made with various degrees of confidence. I'm trying to detect indels with sizes from around 10bp to a few kb.
What I'd like to do is actually figure out _what_ was inserted (the actual nucleotide sequence and how long it is). My question has a few parts.
How complicated is this? How difficult is it to determine the content of the insertion? I imagine for small insertions at least, one could use local de novo assembly, then compare the assembled sequence to the reference. But for insertions around read-length or longer, the insertion can't be reconstructed just from the reads in the area. You need to also find the orphan reads that are all/mostly the insertion sequence, and use those too. Right? Is this something I can implement myself without spending months on it, and do a good job? If not...
What are my options in terms of existing tools? I know there are lots of indel callers. But some of them, for long inerstions at least, don't even tell you what the insertion sequence is. Also, I've already done the work of determining where insertions were made; I don't need these areas to be re-identified, just reconstructed.