I'm new to the community and will start with my first post right away! :)
So, where I work we have a NGS workflow using Illumina sequencers and because of the kits we use we are looking at read lengths of 150-151 . Recently, when my reference-based variant calling pipeline was about to insert into SQL database, it threw a truncation error. After some digging, I found out that when my script was trying to upload the softclipping records (used for troubleshooting sometimes), it was exceeding the SQL field length which was set at 100. Then I looked at the SAM file in IGV and saw a read at the beginning of the genome and the cigar string was: 122S29M. So, 151 read length, 122 softclipped and a matching part of 29 nucleotides. I was really doubtful if this read should have stayed in the mapping as it has 122/151 of its nucleotides trimmed away.
I run bowtie2 version 2.2.2 as the mapper in the pipeline and I use --local setting. Now I want to know why this read is still in my mapping and I looked at the Bowtie2 manual.
The Bowtie2 manual says this on reads that pass score threshold:
Valid alignments meet or exceed the minimum score threshold For an alignment to be considered “valid” (i.e. “good enough”) by Bowtie 2, it must have an alignment score no less than the minimum score threshold. The threshold is configurable and is expressed as a function of the read length. In end-to-end alignment mode, the default minimum score threshold is -0.6 + -0.6 * L, where L is the read length. In local alignment mode, the default minimum score threshold is 20 + 8.0 * ln(L), where L is the read length. This can be configured with the --score-min option. For details on how to set options like --score-min that correspond to functions, see the section on setting function options.
With the example above, the read length of 151 (I checked fastqc) results in the following minimum score threshold: 20 + 8.0 * ln(151) = 60.14 score threshold
Bowtie2 --local setting:
--local In this mode, Bowtie 2 does not require that the entire read align from one end to the other. Rather, some characters may be omitted (“soft clipped”) from the ends in order to achieve the greatest possible alignment score. The match bonus --ma is used in this mode, and the best possible alignment score is equal to the match bonus (--ma) times the length of the read. Specifying --local and one of the presets (e.g. --local --very-fast) is equivalent to specifying the local version of the preset (--very-fast-local). This is mutually exclusive with --end-to-end. --end-to-end is the default mode.
So, lets look at match bonus –ma:
--ma <int> Sets the match bonus. In --local mode <int> is added to the alignment score for each position where a read character aligns to a reference character and the characters match. Not used in --end-to-end mode. Default: 2.
I don't specify the int, so I use default 2. The part of the cigar that says 29M then results in score 29*2 = 58! This is below 60.14..
Does anyone know what is going on here or did I make a wrong assumption along the way?
Thanks for any thoughts and help,