Question

6-bp deletions in RNA-seq data

0

Entering edit mode

6.0 years ago

Thomas ▴ 30

Hi Everyone,

I am attempting to analyze RNA-seq data starting from fastq files. I have used HISAT2 to align the raw fastq files with default settings. No hard clipping or other modifications. The reference genome is mus musculus GRCm38. I then used samtools to convert SAM files to BAM files and name-sort them. Finally, I ran QoRTs for quality control on these files, which is where I came across my actual problem: There is a worrisome rate of deletions of exactly 6 bp in most of my 16 samples. For example:

OP  LEN CT
D   1   50475
D   2   19588
D   3   2965
D   4   1047
D   5   1562
D   6   6769
D   7   94
D   8   169
H   1   0
I   1   63991
I   2   5913
I   3   1668
I   4   579
I   5   300
I   6   153
I   7   59
I   8   19
M   1   75615
M   2   97208
M   3   112564

, where D in the OP column is a deletion, LEN the length of the deletion and CT is count per million reads. Notice the spike at a length of 6 bp.

Alteratively, you can see the full plot here.

I thought this might be related to the adaptor sequence, but I don't understand why this would show up as a deletion.

Does anyone know what this is? And do I need to worry about it? I would appreciate any feedback.

Thank you!

Thomas

RNA-Seq genome alignment sequence • 1.4k views

ADD COMMENT • link 6.0 years ago by Thomas ▴ 30

0

Entering edit mode

Have you considered the possibility that your specific strain could be different in those locations compared to the reference.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

I have thought about that possibility, but 1) that would have to be an incredibly highly expressed gene, to make up such a huge proportion of total reads, and 2) the company that performed the original sequencing used hard trimming of the fastq files before alignment with TopHat. In their BAM files, this spike does not seem to be present.

This makes me think it might be a more systemic problem, but not sure...

ADD REPLY • link 6.0 years ago by Thomas ▴ 30

0

Entering edit mode

Can you capture the deletion containing reads and visualize them in a a genome browser, to see the location? If its adapter related, it should be at the ends. In which case you can safely ignore. Would be really curious to know what you find :-)

ADD REPLY • link 6.0 years ago by nancy ▴ 90

0

Entering edit mode

Hm, I like that idea! Do you know how I can do this?

ADD REPLY • link 6.0 years ago by Thomas ▴ 30

score 2 · Accepted Answer · 2018-04-16

2

Entering edit mode

6.0 years ago

nancy ▴ 90

Hi Thomas, I don't exactly know how, but can point you in the direction:

Once you align, you can extract deletion containing reads by parsing the CIGAR string in the bam file (http://felixfan.github.io/bam-sam/)

Not sure you can specifically get at 6 bp long deletions, or just all reads with deletions.

The new bam file (containing only deletions) can be visualized using IGV (http://software.broadinstitute.org/software/igv/).

Sorry my input is rather sketchy and you will need to fill in the blanks.

Best

ADD COMMENT • link 6.0 years ago by nancy ▴ 90

0

Entering edit mode

Hi Nancy,

This is an excellent starting point. I'll see if I can figure it out.

Thank you!

Thomas

ADD REPLY • link 6.0 years ago by Thomas ▴ 30

0

Entering edit mode

"In short, the peak in 6-bp deletions is due to repetitive elements in the genome, and a single gene accounts for most of it. I am no longer worried. Thank you all for the input!!"

Wonderful. Thanks so much for updating and sharing your code!

p.s.: As an aside, it might be worth doing an overall check for rRNA contamination in your data.

ADD REPLY • link 6.0 years ago by nancy ▴ 90

score 2 · Accepted Answer · 2018-04-17

Alright, I have followed Nancy's advice and extracted the read sequences that contained the 6-bp deletions. It turns out that the vast majority of them (ca. 80% in samples with lower number of 6-bp deletions, 99% in the sample with the highest number of 6-bp deletions) mapped to a specific region of a single gene encoding ribosomal RNA on chromosome 17, named Rn45s. The region is flagged by RepeatMasker. The remaining fragments with 6-bp deletions had long stretches of repetitive sequence.

In short, the peak in 6-bp deletions is due to repetitive elements in the genome, and a single gene accounts for most of it. I am no longer worried. Thank you all for the input!!

In case anyone has a similar problem, this is how you can extract the information from a BAM file and print it into a tab-delimited text file (almost entirely based on this script):

#install Rsamtools
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")

#load library
library(Rsamtools)

#read in entire BAM file
bam_data <- scanBam("file.bam")

#test: names of the BAM fields  
names(bam_data[[1]])
#should return this: [1] "qname"  "flag"   "rname"  "strand" "pos"    "qwidth" "mapq"   "cigar"
#[9] "mrnm"   "mpos"   "isize"  "seq"    "qual"

#create function for collapsing the list of lists into a single list
#as per the Rsamtools vignette
.unlist <- function(x){
      ## do.call(c, ...) coerces factor to integer, which is undesired
      x1 <- x[[1L]]
      if(is.factor(x1)){
          structure(unlist(x), class = "factor", levels = levels(x1))
      } else {
         do.call(c, x)
      }
    }

#note that function .unlist will not appear in Functions list by default, as beginning with .

#store names of BAM fields
bam_field <- names(bam_data[[1]])

#go through each BAM field and unlist
list <- lapply(bam_field, function(y) .unlist(lapply(bam_data, "[[", y)))

#store as data frame (df)
bam_data_df <- do.call("DataFrame", list)
names(bam_data_df) <- bam_field

#test: display dimensions of dataframe
dim(bam_data_df)

#To identify all entries that contain "6D" in cigar column and export those to tab-delimited file:
write.table(bam_data_df[grep("6D", bam_data_df$cigar), ], "bam_6Donly.tsv", sep="\t")