Question: Estimate stdev and mean in paired-end
0
3.3 years ago by
Ric280
Australia
Ric280 wrote:

Hi, I found two following approaches how to estimate stdev and mean in paired-end:

First approach:

``````samtools view OC.bam | head -n 10000 | cut -f9 > initial.insertsizes.txt

a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])       # mean
[1] 345.9762
sd(a.v[a.v >= mn & a.v <= mx])
[1] 39.57006
``````

Second approach:

``````samtools view OC.bam | head -n 10000 | awk '{ if (\$9 > 0) { N+=1; S+=\$9; S2+=\$9*\$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}'

n=5081, mean=345.141, stdev=70.5826
``````

In both approaches the mean look the same, but stdev is different.

Which stdev is correct?

Thank you in advance.

Mic

ADD COMMENTlink
modified 3.3 years ago by i.sudbery6.3k • written 3.3 years ago by Ric280

I think the second way was trying to use Welfordâ€™s method
http://www.johndcook.com/blog/standard_deviation/

in such case it should be

sqrt ((N*S2-S*S)/N(N-1))

can you try it?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Medhat8.6k

I changed it to `awk '{ if (\$9 > 0) { N+=1; S+=\$9; S2+=\$9*\$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((N*S2-S*S)/N(N-1))}'`, but I got `awk: (FILENAME=- FNR=10000) fatal: function`N' not defined`.

ADD REPLYlink written 3.3 years ago by Ric280

Just put quotes around it

``````sqrt (((N*S2)-(S*S))/(N(N-1)))
``````
ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Medhat8.6k

It did not work. Could you please provide the full command.

ADD REPLYlink written 3.3 years ago by Ric280

samtools view OC.bam | head -n 10000 | awk '{ if (\$9 > 0) { N+=1; S+=\$9; S2+=\$9\$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt(((NS2)-(S*S))/(N(N-1)))}'

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Medhat8.6k

I still get `awk: (FILENAME=- FNR=200000) fatal: function`N' not defined` with the above command.

ADD REPLYlink written 3.3 years ago by Ric280

Is this RNA or DNA? If its RNA, you are going to be thrown off by splicing - where reads are in different exons you'll get a much bigger insert size.

ADD REPLYlink written 3.3 years ago by i.sudbery6.3k

You often have read pairs with a very large insert size that bias the mean. Thus, you may want to consider using median and median absolute deviation - median() and mad() in R.

ADD REPLYlink written 3.3 years ago by trausch1.4k

You are only sampling 10000 lines when there may be millions in this file.

Take a look at Qualimap. It can provide all sorts of addtional information about your bam files.

If you are able to go back to original R1/R2 files then you can use BBMerge from BBMap to estimate the insert sizes like described here.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genomax75k

BBMerge provides only Avg Insert and not mean and standard deviation is different to `CollectInsertSizeMetrics`.

ADD REPLYlink written 3.3 years ago by Ric280
0
3.3 years ago by
i.sudbery6.3k
Sheffield, UK
i.sudbery6.3k wrote:

picard's CollectInstertSizeMetrics should calculate these.

ADD COMMENTlink written 3.3 years ago by i.sudbery6.3k

`java -jar picard.jar CollectInsertSizeMetrics I=OC.sam O=insert_size_metrics.txt H=insert_size_histogram.pdf` has given me for `STANDARD_DEVIATION = 70.562457` and `MEAN_INSERT_SIZE = 344.804478`

ADD REPLYlink written 3.3 years ago by Ric280
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1202 users visited in the last hour