Question

Estimate stdev and mean in paired-end

0

Entering edit mode

7.7 years ago

Ric ▴ 430

Hi, I found two following approaches how to estimate stdev and mean in paired-end:

First approach:

samtools view OC.bam | head -n 10000 | cut -f9 > initial.insertsizes.txt

a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])       # mean
[1] 345.9762
sd(a.v[a.v >= mn & a.v <= mx])
[1] 39.57006

Second approach:

samtools view OC.bam | head -n 10000 | awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}' 

n=5081, mean=345.141, stdev=70.5826

In both approaches the mean look the same, but stdev is different.

Which stdev is correct?

Thank you in advance.

Mic

masurca Assembly paired-end sequencing • 2.6k views

ADD COMMENT • link updated 7.7 years ago by i.sudbery 19k • written 7.7 years ago by Ric ▴ 430

0

Entering edit mode

I think the second way was trying to use Welford’s method
http://www.johndcook.com/blog/standard_deviation/

in such case it should be

sqrt ((N*S2-S*S)/N(N-1))

can you try it?

ADD REPLY • link 7.7 years ago by Medhat 9.7k

0

Entering edit mode

I changed it to awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((N*S2-S*S)/N(N-1))}', but I got awk: (FILENAME=- FNR=10000) fatal: functionN' not defined`.

ADD REPLY • link 7.7 years ago by Ric ▴ 430

0

Entering edit mode

Just put quotes around it

sqrt (((N*S2)-(S*S))/(N(N-1)))

ADD REPLY • link 7.7 years ago by Medhat 9.7k

0

Entering edit mode

It did not work. Could you please provide the full command.

ADD REPLY • link 7.7 years ago by Ric ▴ 430

0

Entering edit mode

samtools view OC.bam | head -n 10000 | awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt(((NS2)-(S*S))/(N(N-1)))}'

ADD REPLY • link 7.7 years ago by Medhat 9.7k

0

Entering edit mode

I still get awk: (FILENAME=- FNR=200000) fatal: functionN' not defined` with the above command.

ADD REPLY • link 7.7 years ago by Ric ▴ 430

0

Entering edit mode

Is this RNA or DNA? If its RNA, you are going to be thrown off by splicing - where reads are in different exons you'll get a much bigger insert size.

ADD REPLY • link 7.7 years ago by i.sudbery 19k

0

Entering edit mode

You often have read pairs with a very large insert size that bias the mean. Thus, you may want to consider using median and median absolute deviation - median() and mad() in R.

ADD REPLY • link 7.7 years ago by trausch ★ 1.9k

0

Entering edit mode

You are only sampling 10000 lines when there may be millions in this file.

Take a look at Qualimap. It can provide all sorts of addtional information about your bam files.

If you are able to go back to original R1/R2 files then you can use BBMerge from BBMap to estimate the insert sizes like described here.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

BBMerge provides only Avg Insert and not mean and standard deviation is different to CollectInsertSizeMetrics.

ADD REPLY • link 7.7 years ago by Ric ▴ 430

score 0 · Answer 1 · 2016-08-19

0

Entering edit mode

7.7 years ago

i.sudbery 19k

picard's CollectInstertSizeMetrics should calculate these.

ADD COMMENT • link 7.7 years ago by i.sudbery 19k

0

Entering edit mode

java -jar picard.jar CollectInsertSizeMetrics I=OC.sam O=insert_size_metrics.txt H=insert_size_histogram.pdf has given me for STANDARD_DEVIATION = 70.562457 and MEAN_INSERT_SIZE = 344.804478

ADD REPLY • link 7.7 years ago by Ric ▴ 430