Estimate stdev and mean in paired-end
1
0
Entering edit mode
5.2 years ago
Ric ▴ 360

Hi, I found two following approaches how to estimate stdev and mean in paired-end:

First approach:

samtools view OC.bam | head -n 10000 | cut -f9 > initial.insertsizes.txt

a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))
mx = quantile(a.v, seq(0,1,0.05))
mean(a.v[a.v >= mn & a.v <= mx])       # mean
 345.9762
sd(a.v[a.v >= mn & a.v <= mx])
 39.57006


Second approach:

samtools view OC.bam | head -n 10000 | awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}'

n=5081, mean=345.141, stdev=70.5826


In both approaches the mean look the same, but stdev is different.

Which stdev is correct?

Mic

masurca Assembly paired-end sequencing • 1.7k views
0
Entering edit mode

I think the second way was trying to use Welford’s method
http://www.johndcook.com/blog/standard_deviation/

in such case it should be

sqrt ((N*S2-S*S)/N(N-1))

can you try it?

0
Entering edit mode

I changed it to awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((N*S2-S*S)/N(N-1))}', but I got awk: (FILENAME=- FNR=10000) fatal: functionN' not defined.

0
Entering edit mode

Just put quotes around it

sqrt (((N*S2)-(S*S))/(N(N-1)))

0
Entering edit mode

It did not work. Could you please provide the full command.

0
Entering edit mode

samtools view OC.bam | head -n 10000 | awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt(((NS2)-(S*S))/(N(N-1)))}'

0
Entering edit mode

I still get awk: (FILENAME=- FNR=200000) fatal: functionN' not defined with the above command.

0
Entering edit mode

Is this RNA or DNA? If its RNA, you are going to be thrown off by splicing - where reads are in different exons you'll get a much bigger insert size.

0
Entering edit mode

You often have read pairs with a very large insert size that bias the mean. Thus, you may want to consider using median and median absolute deviation - median() and mad() in R.

0
Entering edit mode

You are only sampling 10000 lines when there may be millions in this file.

Take a look at Qualimap. It can provide all sorts of addtional information about your bam files.

If you are able to go back to original R1/R2 files then you can use BBMerge from BBMap to estimate the insert sizes like described here.

0
Entering edit mode

BBMerge provides only Avg Insert and not mean and standard deviation is different to CollectInsertSizeMetrics.

0
Entering edit mode
5.2 years ago

picard's CollectInstertSizeMetrics should calculate these.

0
Entering edit mode

java -jar picard.jar CollectInsertSizeMetrics I=OC.sam O=insert_size_metrics.txt H=insert_size_histogram.pdf has given me for STANDARD_DEVIATION = 70.562457 and MEAN_INSERT_SIZE = 344.804478