Question: Estimate Insert Size In Paired-End/Mate-Pair
gravatar for Ric
8.7 years ago by
Ric160 wrote:


I have Illumina paired-end/mate-pair reads and I would like to find out the min and max insert size is. For Aligner like SOAP2 is necessary to specify the insert size?

What is the best way to calculate the min and max insert size?

Thank you in advance.

ADD COMMENTlink written 8.7 years ago by Ric160

Usually it does not matter too much. Giving the max a large value (e.g. 1000bp for typical Illumina reads) is fine. Or use BWA, which estimates the insert size on the fly.

ADD REPLYlink written 8.7 years ago by lh332k
gravatar for csmiller
8.4 years ago by
csmiller160 wrote:

From a SAM/BAM file, you can use Picard:

You can also get mean and standard deviation quickly from a SAM/BAM file using these two awk scripts, replacing YOUR_MEAN in the second line with the output of the first line:

head -10000 mappings.sam | awk '{if ($9 > 0) {S+=$9; T+=1}}END{print "Mean: " S/T}' 
head -10000 mappings.sam | awk 'M=YOUR_MEAN{if ($9 > 0) {S+=($9-M)*($9-M); T+=1}}END{print "StdDev: " sqrt(S/T)}'
ADD COMMENTlink written 8.4 years ago by csmiller160

Or calculate both stats simultaneously:

awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}'

Note that this (and your examples) counts all pairs twice, you probably should check the flags to only count the "left" ends or something.

ADD REPLYlink modified 8.1 years ago • written 8.1 years ago by Ketil4.0k

I know I'm late in the game, but how would you check for left ends only, using awk?

ADD REPLYlink written 7.4 years ago by Lee Katz3.0k

continuing the late-comment theme... For bowtie alignments, the TLEN field ($9 in awk) is negative for the reverse-strand end, so I don't think this will actually count twice with bowtie. This will, however, double count read pairs that align more than one place. You're right that you could do this better by bitmasking the FLAG field for first in pair and primary alignment, but not all versions of awk have this capability.

ADD REPLYlink written 7.0 years ago by csmiller160

Can you please explain how this formula works:

sqrt ((S2-M*M*N)/(N-1))
ADD REPLYlink written 4.1 years ago by Medhat8.8k

It gives strange results for mean, and I don't know why. For my sam file, for 10000 lines it gives 217 (picard gives 198). This script for all reads gives 1032,79. Overall it seems that the value increases along with the read number.


Other, smaller, file. Picard mean = 156 Qualimap and mentioned awk script mean=1262.04. More strangely, the histogram of insert size has values ending at 475, median is 165.

If there are some outlier rejection in Picard? It looks to me that mean and sd are quite objective values.

ADD REPLYlink modified 13 months ago • written 13 months ago by boczniak767690
gravatar for Oliver
8.7 years ago by
Berlin, Germany
Oliver230 wrote:

I had the same question and helped myself by aligning my reads in several runs with some useful values for this parameter and calculating the insert size distribution afterwards with the script from this thread.

I had to estimate the parameter for a TopHat run and I tried several parameter values for the insert size and I found out that this parameter almost had no impact, at least on the dataset I used.

Hope that helped!

ADD COMMENTlink modified 13 months ago by RamRS30k • written 8.7 years ago by Oliver230
gravatar for Leszek
8.7 years ago by
IIMCB, Poland
Leszek4.1k wrote:

Usually (if there is no reference genome), I get some crappy assembly with or without PE info (SOAPdenovo/Velvet) and then I align 100k paired reads (BWA or Bowtie). From that you will get quite accurate mean and SD of insert size.

ADD COMMENTlink written 8.7 years ago by Leszek4.1k
gravatar for Wen.Huang
8.7 years ago by
Wen.Huang1.2k wrote:

The only rough way to know those numbers pre-alignment would be to look at the gel image (or bioanalyzer's digital gel). For post-alignment reads, 'picard tools' may have a utility that does what you want.

ADD COMMENTlink written 8.7 years ago by Wen.Huang1.2k
gravatar for madk00k
8.5 years ago by
madk00k350 wrote:

You can use Qualimap tool to calculate insert size distribution from BAM file.

ADD COMMENTlink written 8.5 years ago by madk00k350
gravatar for Ketil
8.1 years ago by
Ketil4.0k wrote:

A bit late to the party, but I wrote a small tool to generate various statistics on insert sizes. It does require you to map to a reference, but the reference doesn't have to be very good as you only need enough matches to get an estimate.

More info here:

ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by Ketil4.0k

One note: computing and so on may be greatly affected by outliers. That is why bwa first excludes outliers and then performs the calculation.

ADD REPLYlink written 8.1 years ago by lh332k

Yes, I know. This was my primary motivation for doing quantiles instead, but I should probably do some outlier removal for the statistics. (Currently, my program streams over the data in linear time and constant space, and outlier removal will probably break that. :-( )

ADD REPLYlink written 8.1 years ago by Ketil4.0k

Hey, I tried your program (bamstats) and it seems like it requires a specific version of samtools. I couldn't get it installed, is there any way I can test it out?


ADD REPLYlink written 6.3 years ago by arnstrm1.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1924 users visited in the last hour