Question: 6-bp deletions in RNA-seq data
gravatar for Thomas
11 days ago by
Columbia University
Thomas20 wrote:

Hi Everyone,

I am attempting to analyze RNA-seq data starting from fastq files. I have used HISAT2 to align the raw fastq files with default settings. No hard clipping or other modifications. The reference genome is mus musculus GRCm38. I then used samtools to convert SAM files to BAM files and name-sort them. Finally, I ran QoRTs for quality control on these files, which is where I came across my actual problem: There is a worrisome rate of deletions of exactly 6 bp in most of my 16 samples. For example:

D   1   50475
D   2   19588
D   3   2965
D   4   1047
D   5   1562
D   6   6769
D   7   94
D   8   169
H   1   0
I   1   63991
I   2   5913
I   3   1668
I   4   579
I   5   300
I   6   153
I   7   59
I   8   19
M   1   75615
M   2   97208
M   3   112564

, where D in the OP column is a deletion, LEN the length of the deletion and CT is count per million reads. Notice the spike at a length of 6 bp.

Alteratively, you can see the full plot here.

I thought this might be related to the adaptor sequence, but I don't understand why this would show up as a deletion.

Does anyone know what this is? And do I need to worry about it? I would appreciate any feedback.

Thank you!


ADD COMMENTlink modified 6 days ago • written 11 days ago by Thomas20

Have you considered the possibility that your specific strain could be different in those locations compared to the reference.

ADD REPLYlink written 11 days ago by genomax46k

I have thought about that possibility, but 1) that would have to be an incredibly highly expressed gene, to make up such a huge proportion of total reads, and 2) the company that performed the original sequencing used hard trimming of the fastq files before alignment with TopHat. In their BAM files, this spike does not seem to be present.

This makes me think it might be a more systemic problem, but not sure...

ADD REPLYlink written 11 days ago by Thomas20

Can you capture the deletion containing reads and visualize them in a a genome browser, to see the location? If its adapter related, it should be at the ends. In which case you can safely ignore. Would be really curious to know what you find :-)

ADD REPLYlink written 11 days ago by nancy30

Hm, I like that idea! Do you know how I can do this?

ADD REPLYlink written 10 days ago by Thomas20
gravatar for nancy
8 days ago by
nancy30 wrote:

Hi Thomas, I don't exactly know how, but can point you in the direction:

Once you align, you can extract deletion containing reads by parsing the CIGAR string in the bam file (

Not sure you can specifically get at 6 bp long deletions, or just all reads with deletions.

The new bam file (containing only deletions) can be visualized using IGV (

Sorry my input is rather sketchy and you will need to fill in the blanks.


ADD COMMENTlink written 8 days ago by nancy30

Hi Nancy,

This is an excellent starting point. I'll see if I can figure it out.

Thank you!


ADD REPLYlink written 7 days ago by Thomas20

"In short, the peak in 6-bp deletions is due to repetitive elements in the genome, and a single gene accounts for most of it. I am no longer worried. Thank you all for the input!!"

Wonderful. Thanks so much for updating and sharing your code!

p.s.: As an aside, it might be worth doing an overall check for rRNA contamination in your data.

ADD REPLYlink modified 5 days ago • written 5 days ago by nancy30
gravatar for Thomas
6 days ago by
Columbia University
Thomas20 wrote:

Alright, I have followed Nancy's advice and extracted the read sequences that contained the 6-bp deletions. It turns out that the vast majority of them (ca. 80% in samples with lower number of 6-bp deletions, 99% in the sample with the highest number of 6-bp deletions) mapped to a specific region of a single gene encoding ribosomal RNA on chromosome 17, named Rn45s. The region is flagged by RepeatMasker. The remaining fragments with 6-bp deletions had long stretches of repetitive sequence.

In short, the peak in 6-bp deletions is due to repetitive elements in the genome, and a single gene accounts for most of it. I am no longer worried. Thank you all for the input!!

In case anyone has a similar problem, this is how you can extract the information from a BAM file and print it into a tab-delimited text file (almost entirely based on this script):

#install Rsamtools

#load library

#read in entire BAM file
bam_data <- scanBam("file.bam")

#test: names of the BAM fields  
#should return this: [1] "qname"  "flag"   "rname"  "strand" "pos"    "qwidth" "mapq"   "cigar"
#[9] "mrnm"   "mpos"   "isize"  "seq"    "qual"

#create function for collapsing the list of lists into a single list
#as per the Rsamtools vignette
.unlist <- function(x){
      ##, ...) coerces factor to integer, which is undesired
      x1 <- x[[1L]]
          structure(unlist(x), class = "factor", levels = levels(x1))
      } else {, x)

#note that function .unlist will not appear in Functions list by default, as beginning with .

#store names of BAM fields
bam_field <- names(bam_data[[1]])

#go through each BAM field and unlist
list <- lapply(bam_field, function(y) .unlist(lapply(bam_data, "[[", y)))

#store as data frame (df)
bam_data_df <-"DataFrame", list)
names(bam_data_df) <- bam_field

#test: display dimensions of dataframe

#To identify all entries that contain "6D" in cigar column and export those to tab-delimited file:
write.table(bam_data_df[grep("6D", bam_data_df$cigar), ], "bam_6Donly.tsv", sep="\t")
ADD COMMENTlink modified 6 days ago • written 6 days ago by Thomas20

You can accept your own and @nancy's answer as correct (green check mark) to provide closure to this thread.

ADD REPLYlink written 6 days ago by genomax46k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1734 users visited in the last hour