Question: Estimate stdev and mean in paired-end
0
Ric280 wrote:

Hi, I found two following approaches how to estimate stdev and mean in paired-end:

First approach:

``````samtools view OC.bam | head -n 10000 | cut -f9 > initial.insertsizes.txt

a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))
mx = quantile(a.v, seq(0,1,0.05))
mean(a.v[a.v >= mn & a.v <= mx])       # mean
 345.9762
sd(a.v[a.v >= mn & a.v <= mx])
 39.57006
``````

Second approach:

``````samtools view OC.bam | head -n 10000 | awk '{ if (\$9 > 0) { N+=1; S+=\$9; S2+=\$9*\$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}'

n=5081, mean=345.141, stdev=70.5826
``````

In both approaches the mean look the same, but stdev is different.

Which stdev is correct?

Mic

modified 3.3 years ago by i.sudbery6.3k • written 3.3 years ago by Ric280

I think the second way was trying to use Welford’s method
http://www.johndcook.com/blog/standard_deviation/

in such case it should be

sqrt ((N*S2-S*S)/N(N-1))

can you try it?

I changed it to `awk '{ if (\$9 > 0) { N+=1; S+=\$9; S2+=\$9*\$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((N*S2-S*S)/N(N-1))}'`, but I got `awk: (FILENAME=- FNR=10000) fatal: function`N' not defined`.

Just put quotes around it

``````sqrt (((N*S2)-(S*S))/(N(N-1)))
``````

It did not work. Could you please provide the full command.

samtools view OC.bam | head -n 10000 | awk '{ if (\$9 > 0) { N+=1; S+=\$9; S2+=\$9\$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt(((NS2)-(S*S))/(N(N-1)))}'

I still get `awk: (FILENAME=- FNR=200000) fatal: function`N' not defined` with the above command.

Is this RNA or DNA? If its RNA, you are going to be thrown off by splicing - where reads are in different exons you'll get a much bigger insert size.

You often have read pairs with a very large insert size that bias the mean. Thus, you may want to consider using median and median absolute deviation - median() and mad() in R.

You are only sampling 10000 lines when there may be millions in this file.

Take a look at Qualimap. It can provide all sorts of addtional information about your bam files.

If you are able to go back to original R1/R2 files then you can use BBMerge from BBMap to estimate the insert sizes like described here.

BBMerge provides only Avg Insert and not mean and standard deviation is different to `CollectInsertSizeMetrics`.

0
i.sudbery6.3k wrote:

picard's CollectInstertSizeMetrics should calculate these.

`java -jar picard.jar CollectInsertSizeMetrics I=OC.sam O=insert_size_metrics.txt H=insert_size_histogram.pdf` has given me for `STANDARD_DEVIATION = 70.562457` and `MEAN_INSERT_SIZE = 344.804478`