insert size in bam file do not make sense
Entering edit mode
12 months ago
ccagg ▴ 60

I have paired end 150bp high throughput methylation sequencing data. I have aligned the reads using bsbolt (which uses a modified version of BWA) and now have sorted bam files.

The data is from plasma cell-free DNA and it is of interest to calculate the size of the cfDNA being captured by the read pairs (like in this paper) in order to estimate the distribution of cfDNA fragment lengths (where the average cfDNA molecule is ~160bp). I've been trying to calculate this using the insert size statistic in the bam file, but the numbers do not make sense with definitions I've seen in other places on biostars. For example, this post says I must subtract the length of the read from insert size length I get from the bam file. However, the values I get from the command I am running are all less than 150bp, which is the size of my read, so these numbers do not seem to be consistent with previous definitions of insert size.

The exact command I have been running is

samtools view -f 67 -F 2304 test.bam | cut -f9 > test_insert_sizes.txt

The -f 67 parameter was to get the first in pair of a properly paired read and the -F 2304 is to eliminate reads that are secondary or supplementary alignments. I am using samtools 1.13

I have been reading the samtools documentation for sam files (2022 edition) and it seems like the definition of TLEN may have changed (and the change seems especially relevant in my case where I might expect overlapping of R1 and R2 due to the short fragment size of cell-free DNA).

My question is, then, what is the appropriate way to calculate the length of the DNA fragment that is being captured by a read pair? I would prefer to have a number for every read pair, and not summary statistics.


samtools cfDNA • 947 views
Entering edit mode
12 months ago

The insert size could be smaller than your fragment size if there is considerable adapter sequence in your reads (depends on your lib prep + everything else).

I'm not a huge fan of using the samtools flags for doing this kind of analysis. I think what you're doing is correct but that the TLEN is the insert length so I don't think subtraction is necessary. Just make sure you are only using properly paired/oriented reads.

tlen samtools

In this scenario, I would crunch the data manual (reduce the paired reads in R/python and just calculate the full alignment distance) as a sanity check.

Otherwise try a tried another trusted method that's more straightforward:


Login before adding your answer.

Traffic: 2990 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6