4.8 years ago by
Estimating insert size the easy way:
1) Align the paired-end data in a combined way against the reference genome. By combined, I mean you should provide both the fastq files to the aligner.
2) Extract information from the ninth column of the SAM file (TLEN). To be more accurate, you can only select the number in the ninth column of those paired-end read pairs that are uniquely aligned against the genome.
3) Calculate the mean of the distribution of TLEN numbers generated in step 2.
4) Subtract length of a read (For example 75 bp or 100 bp) from the mean to get insert size.
For aligners like BWA, you need not to give the insert size. It does it for you automatically. See below paragraph from BWA manual to know how it does it:
Estimating Insert Size Distribution
BWA estimates the insert size distribution per 256*1024 read pairs. It first collects pairs of reads with both ends mapped with a single-end quality 20 or higher and then calculates median (Q2), lower and higher quartile (Q1 and Q3). It estimates the mean and the variance of the insert size distribution from pairs whose insert sizes are within interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair considered to be properly paired (SAM flag 0x2) is calculated by solving equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean, sigma is the standard error of the insert size distribution, L is the length of the genome, p0 is prior of anomalous pair and Phi() is the standard cumulative distribution function. For mapping Illumina short-insert reads to the human genome, x is about 6-7 sigma away from the mean. Quartiles, mean, variance and x will be printed to the standard error output.
Estimating insert size your way or hard way:
1) Sort both the SAM files using queryname.
2) Remove secondary alignments for a read so that read order and read indexes become same in both the SAM files.
3) Now for every read, calculate absolute difference in their mapping position (Ignore reads that are mapped to different chromosomes
4) Take the mean of the distribution and subtract the read length.