Question

long insert size estimation

1

Entering edit mode

10.8 years ago

Damian Kao 16k

I have multiple libraries supposedly made at insert sizes from 6kb to 20kb.

I have un-scaffolded contigs (~30X coverage) from a previous assembly of this genome. There isn't a scaffolded assembly available right now.

What is the best way to go about estimating insert sizes for these long insert libraries?

I've tried mapping the libraries back to contigs > 3kb with bowtie2 specifying -I -X parameters (minimum and maximum insert sizes). I've tried different minimums and the insert size distribution seem to center around this minimum insert size paramter. This makes me think that bowtie2 is just not handling mapping long insert sizes correctly.

Should I be using another mapper? There are only ~100,000 contigs > 3kb in the genome contigs which resulted in only ~10k-100k reads mapping among the libraries. Should I even bother trying to estimate insert size from these contigs? Should I re-contig assemble with the current reads?

genome insert-size • 2.5k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Damian Kao 16k

score 0 · Answer 1 · 2014-10-03

I always thought that the parameters for insert size estimation are misguided and it is never clear what happens when one makes use of them. The concept of 'concordant pair' or 'mapped in the proper pair' imposes a certain view of the genome should be like but that may not be at all what the current genome is. It is from an era where people thought that non-concordant must mean an error.

I would use a mapper that does not have any expectation of the "right" mate sizes are and work backwards see how the mates map individually and how far they are from each other. Could be even mapped in single end mode.

Ram · Answer 2 · 2014-10-03

I normally use GSNAP to map the mate pairs back to the closest reference genome (with stringent settings: no soft clipping, low number of mis-matches etc) and then plot the mapping distance between the read pairs. The settings --pairmax-dna and --pairdev allows you to specify the insert size and standard deviation to be used for calling concordant/discordant. I hope this helps!

BTW, GSNAP will be exceptionally slower compared to other aligners/mappers unless you split them up!