Hi, I'm doing variant calling from my RNAseq data on a cell line (Paired End 2X100b, Illumina) I'm following GATK workflows and everything is working almost fine since I can find all the expected mutations already characterized in my cell line. The problem is that I see some additional mutations that are clearly the result of an alignment error (e.g. NM_001364837:exon8:c.727+1G>A, rs781065280) I'm using the latest release of STAR aligner 2.7.6a to perform the alignment with the following options
STAR --runThreadN 8 --genomeDir $GENOME_DIR \ --readFilesIn ... --readFilesCommand gunzip -c \ --outSAMattrRGline ... \ --outSAMtype BAM SortedByCoordinate \ --twopassMode Basic \ --outSAMmapqUnique 60 \ --outFilterType BySJout --alignIntronMin 20 \ --alignIntronMax 1000000 --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 --alignSJDBoverhangMin 3 --scoreGenomicLengthLog2scale 0 \ --outFileNamePrefix $OUTPUT_DIR/...
The problem resides in STAR mapping some reads with a mismatching overhang of 4 bp at beginning of the intron instead of picking a perfect match at the beginning of the next exon (see picture attached).
As you can see,
--alignSJDBoverhangMin is set to 3 and the length of the overhang is 4bp.
Also the splice junction should be canonical so a spliced mapping should carry no penalty.
Anybody has an explanation for that?