Entering edit mode

5.2 years ago

Ric
▴
360

Hi, I found two following approaches how to estimate stdev and mean in paired-end:

First approach:

```
samtools view OC.bam | head -n 10000 | cut -f9 > initial.insertsizes.txt
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean
[1] 345.9762
sd(a.v[a.v >= mn & a.v <= mx])
[1] 39.57006
```

Second approach:

```
samtools view OC.bam | head -n 10000 | awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}'
n=5081, mean=345.141, stdev=70.5826
```

In both approaches the mean look the same, but stdev is different.

Which stdev is correct?

Thank you in advance.

Mic

I think the second way was trying to use Welfordâ€™s method

http://www.johndcook.com/blog/standard_deviation/

in such case it should be

can you try it?

I changed it to

`awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((N*S2-S*S)/N(N-1))}'`

, but I got`awk: (FILENAME=- FNR=10000) fatal: function`

N' not defined`.Just put quotes around it

It did not work. Could you please provide the full command.

I still get

`awk: (FILENAME=- FNR=200000) fatal: function`

N' not defined` with the above command.Is this RNA or DNA? If its RNA, you are going to be thrown off by splicing - where reads are in different exons you'll get a much bigger insert size.

You often have read pairs with a very large insert size that bias the mean. Thus, you may want to consider using median and median absolute deviation - median() and mad() in R.

You are only sampling 10000 lines when there may be millions in this file.

Take a look at Qualimap. It can provide all sorts of addtional information about your bam files.

If you are able to go back to original R1/R2 files then you can use BBMerge from BBMap to estimate the insert sizes like described here.

BBMerge provides only Avg Insert and not mean and standard deviation is different to

`CollectInsertSizeMetrics`

.