Question: Estimate Insert Size In Paired-End/Mate-Pair
16
8.0 years ago by
Ric160
Ric160 wrote:

Hello,

I have Illumina paired-end/mate-pair reads and I would like to find out the min and max insert size is. For Aligner like SOAP2 is necessary to specify the insert size?

What is the best way to calculate the min and max insert size?

written 8.0 years ago by Ric160
4

Usually it does not matter too much. Giving the max a large value (e.g. 1000bp for typical Illumina reads) is fine. Or use BWA, which estimates the insert size on the fly.

16
7.7 years ago by
csmiller160
csmiller160 wrote:

From a SAM/BAM file, you can use Picard:

http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetrics

You can also get mean and standard deviation quickly from a SAM/BAM file using these two awk scripts, replacing YOUR_MEAN in the second line with the output of the first line:

``````head -10000 mappings.sam | awk '{if (\$9 > 0) {S+=\$9; T+=1}}END{print "Mean: " S/T}'
head -10000 mappings.sam | awk 'M=YOUR_MEAN{if (\$9 > 0) {S+=(\$9-M)*(\$9-M); T+=1}}END{print "StdDev: " sqrt(S/T)}'
``````
1

Or calculate both stats simultaneously:

``````awk '{ if (\$9 > 0) { N+=1; S+=\$9; S2+=\$9*\$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}'
``````

Note that this (and your examples) counts all pairs twice, you probably should check the flags to only count the "left" ends or something.

I know I'm late in the game, but how would you check for left ends only, using awk?

continuing the late-comment theme... For bowtie alignments, the TLEN field (\$9 in awk) is negative for the reverse-strand end, so I don't think this will actually count twice with bowtie. This will, however, double count read pairs that align more than one place. You're right that you could do this better by bitmasking the FLAG field for first in pair and primary alignment, but not all versions of awk have this capability.

Can you please explain how this formula works:

``````sqrt ((S2-M*M*N)/(N-1))
``````

It gives strange results for mean, and I don't know why. For my sam file, for 10000 lines it gives 217 (picard gives 198). This script for all reads gives 1032,79. Overall it seems that the value increases along with the read number.

Edit.

Other, smaller, file. Picard mean = 156 Qualimap and mentioned awk script mean=1262.04. More strangely, the histogram of insert size has values ending at 475, median is 165.

If there are some outlier rejection in Picard? It looks to me that mean and sd are quite objective values.

6
8.0 years ago by
Oliver230
Berlin, Germany
Oliver230 wrote:

I had the same question and helped myself by aligning my reads in several runs with some useful values for this parameter and calculating the insert size distribution afterwards with the script from this thread.

I had to estimate the parameter for a TopHat run and I tried several parameter values for the insert size and I found out that this parameter almost had no impact, at least on the dataset I used.

Hope that helped!

4
8.0 years ago by
Leszek4.0k
IIMCB, Poland
Leszek4.0k wrote:

Usually (if there is no reference genome), I get some crappy assembly with or without PE info (SOAPdenovo/Velvet) and then I align 100k paired reads (BWA or Bowtie). From that you will get quite accurate mean and SD of insert size.

3
8.0 years ago by
Wen.Huang1.2k
Wen.Huang1.2k wrote:

The only rough way to know those numbers pre-alignment would be to look at the gel image (or bioanalyzer's digital gel). For post-alignment reads, 'picard tools' may have a utility that does what you want.

3
7.8 years ago by
Heidelberg

You can use Qualimap tool to calculate insert size distribution from BAM file.

2
7.4 years ago by
Ketil4.0k
Germany
Ketil4.0k wrote:

A bit late to the party, but I wrote a small tool to generate various statistics on insert sizes. It does require you to map to a reference, but the reference doesn't have to be very good as you only need enough matches to get an estimate.

1

One note: computing std.dev and so on may be greatly affected by outliers. That is why bwa first excludes outliers and then performs the calculation.

1

Yes, I know. This was my primary motivation for doing quantiles instead, but I should probably do some outlier removal for the statistics. (Currently, my program streams over the data in linear time and constant space, and outlier removal will probably break that. :-( )

Hey, I tried your program (bamstats) and it seems like it requires a specific version of samtools. I couldn't get it installed, is there any way I can test it out?

Thanks!