Difference between samtools stats insert size and TLEN in SAM/BAM file.
0
0
Entering edit mode
14 months ago

Hi, I have stumbled upon certain discrepancy between insert sizes obtained by two different ways from my data. To keep the terminology consistent I will support myself with a image found in other thread.

fragment length and insert size:

image: fragment length and insert size

To give some more perspective I am working with cfDNA sequencing data obtained through gene panels. My goal is to do insert size distribution plot. To do so I have obtained insert sizes by two different approaches:

  1. Manual - due to lack of better naming I call it manual, since it is more hands-on than other.
  2. Samtools stats

Manual approach is based insert sizes being a sum of clipped bases and TLEN values. So insert size (IS) here supposedly is sum of values next to 'S' and 'H' in CIGAR sequence and TLEN. (Eg. CIGAR -> 8S19M2S10=3M; TLEN -> -159 ==> IS -> 169). Going into details my bash script for extraction contains:

samtools view -f 67 ${input_directory}/${id}.bam | cut -f6,9 > ${output_directory}/${id}.extract.txt

I am using flag67 because I am dealing with paired-end sequencing data and there is no need for a second read since both mates ought to have same absolute value of TLEN.

In a separate script I am adding up amount of soft clipped pairs to the TLEN value (although I've now realised that I should also add soft clipped pairs from both mates).

In the samtools stats approach I have taken advantage of "Insert Size" section, one doesn't need further explanation if it looks at samtools documentation.

I have plotted both approaches and visibly there is unexpected spike of amount of insert sizes for manual approach in vicinity of 145 bp. More than that both data sets differ in amounts of specific insert sizes.

Insert size distribution:

image: Insert size distribution

fragment lengths:

image: fragment lengths

How Insert size is defined in samtools stats? Why is there a spike around 145 bp in manual approach? Does it have something to do with read lengths? Is my manual approach even so slightly correct?

samtools insert-size SAM • 794 views
ADD COMMENT
0
Entering edit mode

there is unexpected spike of amount of insert sizes for manual approach in vicinity of 145 bp

Is this data scanned and trimmed to remove adapter sequences. That sounds suspiciously about the size of primer dimers.

ADD REPLY

Login before adding your answer.

Traffic: 3792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6