Hi Everyone,
I am attempting to analyze RNA-seq data starting from fastq files. I have used HISAT2 to align the raw fastq files with default settings. No hard clipping or other modifications. The reference genome is mus musculus GRCm38. I then used samtools to convert SAM files to BAM files and name-sort them. Finally, I ran QoRTs for quality control on these files, which is where I came across my actual problem: There is a worrisome rate of deletions of exactly 6 bp in most of my 16 samples. For example:
OP LEN CT
D 1 50475
D 2 19588
D 3 2965
D 4 1047
D 5 1562
D 6 6769
D 7 94
D 8 169
H 1 0
I 1 63991
I 2 5913
I 3 1668
I 4 579
I 5 300
I 6 153
I 7 59
I 8 19
M 1 75615
M 2 97208
M 3 112564
, where D in the OP column is a deletion, LEN the length of the deletion and CT is count per million reads. Notice the spike at a length of 6 bp.
Alteratively, you can see the full plot here.
I thought this might be related to the adaptor sequence, but I don't understand why this would show up as a deletion.
Does anyone know what this is? And do I need to worry about it? I would appreciate any feedback.
Thank you!
Thomas
Have you considered the possibility that your specific strain could be different in those locations compared to the reference.
I have thought about that possibility, but 1) that would have to be an incredibly highly expressed gene, to make up such a huge proportion of total reads, and 2) the company that performed the original sequencing used hard trimming of the fastq files before alignment with TopHat. In their BAM files, this spike does not seem to be present.
This makes me think it might be a more systemic problem, but not sure...
Can you capture the deletion containing reads and visualize them in a a genome browser, to see the location? If its adapter related, it should be at the ends. In which case you can safely ignore. Would be really curious to know what you find :-)
Hm, I like that idea! Do you know how I can do this?