Question

Difference between samtools stats insert size and TLEN in SAM/BAM file.

0

Entering edit mode

14 months ago

krzysm20009042 • 0

Hi, I have stumbled upon certain discrepancy between insert sizes obtained by two different ways from my data. To keep the terminology consistent I will support myself with a image found in other thread.

fragment length and insert size:

image: fragment length and insert size

To give some more perspective I am working with cfDNA sequencing data obtained through gene panels. My goal is to do insert size distribution plot. To do so I have obtained insert sizes by two different approaches:

Manual - due to lack of better naming I call it manual, since it is more hands-on than other.
Samtools stats

Manual approach is based insert sizes being a sum of clipped bases and TLEN values. So insert size (IS) here supposedly is sum of values next to 'S' and 'H' in CIGAR sequence and TLEN. (Eg. CIGAR -> 8S19M2S10=3M; TLEN -> -159 ==> IS -> 169). Going into details my bash script for extraction contains:

samtools view -f 67 ${input_directory}/${id}.bam | cut -f6,9 > ${output_directory}/${id}.extract.txt

I am using flag67 because I am dealing with paired-end sequencing data and there is no need for a second read since both mates ought to have same absolute value of TLEN.

In a separate script I am adding up amount of soft clipped pairs to the TLEN value (although I've now realised that I should also add soft clipped pairs from both mates).

In the samtools stats approach I have taken advantage of "Insert Size" section, one doesn't need further explanation if it looks at samtools documentation.

I have plotted both approaches and visibly there is unexpected spike of amount of insert sizes for manual approach in vicinity of 145 bp. More than that both data sets differ in amounts of specific insert sizes.

Insert size distribution:

image: Insert size distribution

fragment lengths:

image: fragment lengths

How Insert size is defined in samtools stats? Why is there a spike around 145 bp in manual approach? Does it have something to do with read lengths? Is my manual approach even so slightly correct?

samtools insert-size SAM • 794 views

ADD COMMENT • link updated 14 months ago by Ram 45k • written 14 months ago by krzysm20009042 • 0

0

Entering edit mode

there is unexpected spike of amount of insert sizes for manual approach in vicinity of 145 bp

Is this data scanned and trimmed to remove adapter sequences. That sounds suspiciously about the size of primer dimers.

ADD REPLY • link 14 months ago by GenoMax 153k