Estimate Insert Size In Paired-End/Mate-Pair
6
18
Entering edit mode
12.8 years ago
Ric ▴ 190

Hello,

I have Illumina paired-end/mate-pair reads and I would like to find out the min and max insert size is. For Aligner like SOAP2 is necessary to specify the insert size?

What is the best way to calculate the min and max insert size?

Thank you in advance.

next-gen sequencing alignment paired • 51k views
ADD COMMENT
4
Entering edit mode

Usually it does not matter too much. Giving the max a large value (e.g. 1000bp for typical Illumina reads) is fine. Or use BWA, which estimates the insert size on the fly.

ADD REPLY
18
Entering edit mode
12.5 years ago
csmiller ▴ 180

From a SAM/BAM file, you can use Picard:

http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetrics

You can also get mean and standard deviation quickly from a SAM/BAM file using these two awk scripts, replacing YOUR_MEAN in the second line with the output of the first line:

head -10000 mappings.sam | awk '{if ($9 > 0) {S+=$9; T+=1}}END{print "Mean: " S/T}' 
head -10000 mappings.sam | awk 'M=YOUR_MEAN{if ($9 > 0) {S+=($9-M)*($9-M); T+=1}}END{print "StdDev: " sqrt(S/T)}'
ADD COMMENT
1
Entering edit mode

Or calculate both stats simultaneously:

awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}'

Note that this (and your examples) counts all pairs twice, you probably should check the flags to only count the "left" ends or something.

ADD REPLY
0
Entering edit mode

I know I'm late in the game, but how would you check for left ends only, using awk?

ADD REPLY
0
Entering edit mode

continuing the late-comment theme... For bowtie alignments, the TLEN field ($9 in awk) is negative for the reverse-strand end, so I don't think this will actually count twice with bowtie. This will, however, double count read pairs that align more than one place. You're right that you could do this better by bitmasking the FLAG field for first in pair and primary alignment, but not all versions of awk have this capability.

ADD REPLY
0
Entering edit mode

Can you please explain how this formula works:

sqrt ((S2-M*M*N)/(N-1))
ADD REPLY
0
Entering edit mode

It gives strange results for mean, and I don't know why. For my sam file, for 10000 lines it gives 217 (picard gives 198). This script for all reads gives 1032,79. Overall it seems that the value increases along with the read number.

Edit.

Other, smaller, file. Picard mean = 156
Qualimap and mentioned awk script mean=1262.04

More strangely, the histogram of insert size has values ending at 475, median is 165.

If there are some outlier rejection in Picard? It looks to me that mean and sd are quite objective values.

ADD REPLY
6
Entering edit mode
12.8 years ago
Oliver ▴ 240

I had the same question and helped myself by aligning my reads in several runs with some useful values for this parameter and calculating the insert size distribution afterwards with the script from this thread.

I had to estimate the parameter for a TopHat run and I tried several parameter values for the insert size and I found out that this parameter almost had no impact, at least on the dataset I used.

Hope that helped!

ADD COMMENT
4
Entering edit mode
12.8 years ago
Leszek 4.2k

Usually (if there is no reference genome), I get some crappy assembly with or without PE info (SOAPdenovo/Velvet) and then I align 100k paired reads (BWA or Bowtie). From that you will get quite accurate mean and SD of insert size.

ADD COMMENT
3
Entering edit mode
12.8 years ago
Wen.Huang ★ 1.2k

The only rough way to know those numbers pre-alignment would be to look at the gel image (or bioanalyzer's digital gel). For post-alignment reads, 'picard tools' may have a utility that does what you want.

ADD COMMENT
3
Entering edit mode
12.6 years ago
madk00k ▴ 360

You can use Qualimap tool to calculate insert size distribution from BAM file.

ADD COMMENT
2
Entering edit mode
12.1 years ago
Ketil 4.1k

A bit late to the party, but I wrote a small tool to generate various statistics on insert sizes. It does require you to map to a reference, but the reference doesn't have to be very good as you only need enough matches to get an estimate.

More info here: http://blog.malde.org/posts/bamstats.html

ADD COMMENT
1
Entering edit mode

One note: computing std.dev and so on may be greatly affected by outliers. That is why bwa first excludes outliers and then performs the calculation.

ADD REPLY
1
Entering edit mode

Yes, I know. This was my primary motivation for doing quantiles instead, but I should probably do some outlier removal for the statistics. (Currently, my program streams over the data in linear time and constant space, and outlier removal will probably break that. :-( )

ADD REPLY
0
Entering edit mode

Hey, I tried your program (bamstats) and it seems like it requires a specific version of samtools. I couldn't get it installed, is there any way I can test it out?

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6