Question

Determine Insertion Sequence

0

Entering edit mode

6.0 years ago

aostern ▴ 30

Hi guys,

I've developed a tool that identifies the points at which insertions are made. This basically means, for various BAM files, I have various lists of coordinates at which I think insertions were made with various degrees of confidence. I'm trying to detect indels with sizes from around 10bp to a few kb.

What I'd like to do is actually figure out _what_ was inserted (the actual nucleotide sequence and how long it is). My question has a few parts.

How complicated is this? How difficult is it to determine the content of the insertion? I imagine for small insertions at least, one could use local de novo assembly, then compare the assembled sequence to the reference. But for insertions around read-length or longer, the insertion can't be reconstructed just from the reads in the area. You need to also find the orphan reads that are all/mostly the insertion sequence, and use those too. Right? Is this something I can implement myself without spending months on it, and do a good job? If not...
What are my options in terms of existing tools? I know there are lots of indel callers. But some of them, for long inerstions at least, don't even tell you what the insertion sequence is. Also, I've already done the work of determining where insertions were made; I don't need these areas to be re-identified, just reconstructed.

Thanks!

assembly indel sv sequencing alignment • 1.7k views

ADD COMMENT • link updated 6.0 years ago by d-cameron ★ 2.9k • written 6.0 years ago by aostern ▴ 30

h.mon · Answer 1 · 2018-05-04

2

Entering edit mode

6.0 years ago

d-cameron ★ 2.9k

If the insertion is small with respect to the fragment size, there's also plenty of tools (e.g. GATK, manta, GRIDSS) that will assemble and detect those.

If a larger insertion is not novel with respect to the reference genome then there's lots of tool that will detect these, although generally be reported as translocations, not insertions.

What are my options in terms of existing tools?

NovelSeq is the only tool I am aware of that will reconstruct novel insertions larger that the fragment size. Unfortunately, it's somewhat old and designed for short (as in 50bp) reads so only uses discordant read pair in their breakend assemblies. GRIDSS (my tool) will give you better assemblies of your dangling breakend, and (the dev version) will also tell you if the inserted sequence looks novel with respect to your reference genome. I haven't yet implemented the de novo assembly part.

Is this something I can implement myself without spending months on it, and do a good job?

If you're doing the assembly yourself? Definitely not.That said, combining GRIDSS breakend assembly contigs with a de novo assembler should, in theory, outperform NovelSeq and be achievable in a reasonable timeframe. It's on my TODO list of extensions to GRIDSS but I'm definitely open to collaboration on the development of such a tool.

ADD COMMENT • link 6.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Thanks for the quick response! GRIDSS looks pretty cool—I think I'll try it! Just to be clear, what information (out of position, length, nucleotide content) does it report on insertions, and for what sizes?

ADD REPLY • link 6.0 years ago by aostern ▴ 30

1

Entering edit mode

For small insertions (less than around half the library fragment size), it reports the exact indel position and sequence. For longer variants, it reports the position and the nucleotide sequence of as much of the insertion as it could assemble from the anchoring reads (soft-clipped reads + unmapped reads with read mate mapped nearby). A large insertion on chr1 between bases 12345 and 12346 would look like:

chr1 12345 example_left_side A ACGTGCGTCGG. . PASS SVTYPE=BND
chr1 12346 example_right_side T .TGTCATGCATGCAT. . PASS SVTYPE=BND

Notice the . which is part of the ALT allele. This notation is covered in VCFv4.3 Section 5.4.9 "Single breakends" and indicates that the AT at position 12345-12346 is now ACGTGCGTCGG?????????????????TGTCATGCATGCAT.

ADD REPLY • link 6.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Breakend support is currently only on the dev branch. I'll be making an official versioned release with the capability within a week.

ADD REPLY • link 6.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Sounds good. Thanks! Have you heard of Pamir (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870608/), btw?

ADD REPLY • link updated 6.0 years ago by h.mon 35k • written 6.0 years ago by aostern ▴ 30

1

Entering edit mode

I hadn't. Looks like it's definitely worth trying out before attempting to put something together yourself.

ADD REPLY • link 6.0 years ago by d-cameron ★ 2.9k

score 1 · Answer 2 · 2018-05-03

1

Entering edit mode

6.0 years ago

h.mon 35k

I can only provide suggestions regarding existing tools.

I think Subread + Subindel programs may do what you want, to a certain extent. Subindel uses the reads mapped by Subread to find indels and perform local assembly, but it is limited to indels up to 200bp.

Another option, as you said you can identify the breakpoints, is to use these flanking regions as seeds for local assemblers, such as Mapsembler2 and Tadpole.

ADD COMMENT • link 6.0 years ago by h.mon 35k

0

Entering edit mode

Thanks for the quick response. It sounds like most tools are pretty limited in the size of the insertions they can find the content of.

ADD REPLY • link 6.0 years ago by aostern ▴ 30