I am attempting to analyze RNA-seq data starting from fastq files. I have used HISAT2 to align the raw fastq files with default settings. No hard clipping or other modifications. The reference genome is mus musculus GRCm38. I then used samtools to convert SAM files to BAM files and name-sort them. Finally, I ran QoRTs for quality control on these files, which is where I came across my actual problem: There is a worrisome rate of deletions of exactly 6 bp in most of my 16 samples. For example:
OP LEN CT D 1 50475 D 2 19588 D 3 2965 D 4 1047 D 5 1562 D 6 6769 D 7 94 D 8 169 H 1 0 I 1 63991 I 2 5913 I 3 1668 I 4 579 I 5 300 I 6 153 I 7 59 I 8 19 M 1 75615 M 2 97208 M 3 112564
, where D in the OP column is a deletion, LEN the length of the deletion and CT is count per million reads. Notice the spike at a length of 6 bp.
Alteratively, you can see the full plot here.
I thought this might be related to the adaptor sequence, but I don't understand why this would show up as a deletion.
Does anyone know what this is? And do I need to worry about it? I would appreciate any feedback.