Question: long insert size estimation
gravatar for Damian Kao
6.2 years ago by
Damian Kao15k
Damian Kao15k wrote:

I have multiple libraries supposedly made at insert sizes from 6kb to 20kb.

I have un-scaffolded contigs (~30X coverage) from a previous assembly of this genome. There isn't a scaffolded assembly available right now. 

What is the best way to go about estimating insert sizes for these long insert libraries?

I've tried mapping the libraries back to contigs > 3kb with bowtie2 specifying -I -X parameters (minimum and maximum insert sizes). I've tried different minimums and the insert size distribution seem to center around this minimum insert size paramter. This makes me think that bowtie2 is just not handling mapping long insert sizes correctly.

Should I be using another mapper? There are only ~100,000 contigs > 3kb in the genome contigs which resulted in only ~10k-100k reads mapping among the libraries. Should I even bother trying to estimate insert size from these contigs? Should I re-contig assemble with the current reads?

insert-size genome • 1.6k views
ADD COMMENTlink modified 6.2 years ago by arnstrm1.8k • written 6.2 years ago by Damian Kao15k
gravatar for Istvan Albert
6.2 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

I always thought that the parameters for insert size estimation are misguided and it is never clear what happens when one makes use of them. The concept of 'concordant pair' or 'mapped in the proper pair' imposes a certain view of the genome should be like but that may not be at all what the current genome is. It is from an era where people thought that non-concordant must mean an error.

I would use a mapper that does not have any expectation of the "right" mate sizes are and work backwards see how the mates map individually and how far they are from each other. Could be even mapped in single end mode.

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Istvan Albert ♦♦ 85k
gravatar for arnstrm
6.2 years ago by
Ames, IA
arnstrm1.8k wrote:

I normally use GSNAP to map the mate pairs back to the closest reference genome (with stringent settings: no soft clipping, low number of mis-matches etc) and then plot the mapping distance between the read pairs. The settings --pairmax-dna and --pairdev allows you to specify the insert size and standard deviation to be used for calling concordant/discordant. I hope this helps!

BTW, GSNAP will be exceptionally slower compared to other aligners/mappers unless you split them up!


ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by arnstrm1.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2072 users visited in the last hour