Hi, I have stumbled upon certain discrepancy between insert sizes obtained by two different ways from my data. To keep the terminology consistent I will support myself with a image found in other thread.
fragment length and insert size:
To give some more perspective I am working with cfDNA sequencing data obtained through gene panels. My goal is to do insert size distribution plot. To do so I have obtained insert sizes by two different approaches:
- Manual - due to lack of better naming I call it manual, since it is more hands-on than other.
- Samtools stats
Manual approach is based insert sizes being a sum of clipped bases and TLEN values. So insert size (IS) here supposedly is sum of values next to 'S' and 'H' in CIGAR sequence and TLEN. (Eg. CIGAR -> 8S19M2S10=3M; TLEN -> -159 ==> IS -> 169). Going into details my bash script for extraction contains:
samtools view -f 67 ${input_directory}/${id}.bam | cut -f6,9 > ${output_directory}/${id}.extract.txt
I am using flag67 because I am dealing with paired-end sequencing data and there is no need for a second read since both mates ought to have same absolute value of TLEN.
In a separate script I am adding up amount of soft clipped pairs to the TLEN value (although I've now realised that I should also add soft clipped pairs from both mates).
In the samtools stats approach I have taken advantage of "Insert Size" section, one doesn't need further explanation if it looks at samtools documentation.
I have plotted both approaches and visibly there is unexpected spike of amount of insert sizes for manual approach in vicinity of 145 bp. More than that both data sets differ in amounts of specific insert sizes.
Insert size distribution:
fragment lengths:
How Insert size is defined in samtools stats? Why is there a spike around 145 bp in manual approach? Does it have something to do with read lengths? Is my manual approach even so slightly correct?
Is this data scanned and trimmed to remove adapter sequences. That sounds suspiciously about the size of primer dimers.