Estimate stdev and mean in paired-end
1
0
Entering edit mode
5.2 years ago
Ric ▴ 360

Hi, I found two following approaches how to estimate stdev and mean in paired-end:

First approach:

samtools view OC.bam | head -n 10000 | cut -f9 > initial.insertsizes.txt

a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])       # mean
[1] 345.9762
sd(a.v[a.v >= mn & a.v <= mx])
[1] 39.57006

Second approach:

samtools view OC.bam | head -n 10000 | awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}' 

n=5081, mean=345.141, stdev=70.5826

In both approaches the mean look the same, but stdev is different.

Which stdev is correct?

Thank you in advance.

Mic

masurca Assembly paired-end sequencing • 1.7k views
ADD COMMENT
0
Entering edit mode

I think the second way was trying to use Welford’s method
http://www.johndcook.com/blog/standard_deviation/

in such case it should be

sqrt ((N*S2-S*S)/N(N-1))

can you try it?

ADD REPLY
0
Entering edit mode

I changed it to awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9*$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((N*S2-S*S)/N(N-1))}', but I got awk: (FILENAME=- FNR=10000) fatal: functionN' not defined`.

ADD REPLY
0
Entering edit mode

Just put quotes around it

sqrt (((N*S2)-(S*S))/(N(N-1)))
ADD REPLY
0
Entering edit mode

It did not work. Could you please provide the full command.

ADD REPLY
0
Entering edit mode

samtools view OC.bam | head -n 10000 | awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt(((NS2)-(S*S))/(N(N-1)))}'

ADD REPLY
0
Entering edit mode

I still get awk: (FILENAME=- FNR=200000) fatal: functionN' not defined` with the above command.

ADD REPLY
0
Entering edit mode

Is this RNA or DNA? If its RNA, you are going to be thrown off by splicing - where reads are in different exons you'll get a much bigger insert size.

ADD REPLY
0
Entering edit mode

You often have read pairs with a very large insert size that bias the mean. Thus, you may want to consider using median and median absolute deviation - median() and mad() in R.

ADD REPLY
0
Entering edit mode

You are only sampling 10000 lines when there may be millions in this file.

Take a look at Qualimap. It can provide all sorts of addtional information about your bam files.

If you are able to go back to original R1/R2 files then you can use BBMerge from BBMap to estimate the insert sizes like described here.

ADD REPLY
0
Entering edit mode

BBMerge provides only Avg Insert and not mean and standard deviation is different to CollectInsertSizeMetrics.

ADD REPLY
0
Entering edit mode
5.2 years ago

picard's CollectInstertSizeMetrics should calculate these.

ADD COMMENT
0
Entering edit mode

java -jar picard.jar CollectInsertSizeMetrics I=OC.sam O=insert_size_metrics.txt H=insert_size_histogram.pdf has given me for STANDARD_DEVIATION = 70.562457 and MEAN_INSERT_SIZE = 344.804478

ADD REPLY

Login before adding your answer.

Traffic: 2812 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6