Question: Issues with strandedness, flags, and generating a count matrix using R. Please help!
1
gravatar for elizabeth.eckhardt91
3.4 years ago by
elizabeth.eckhardt910 wrote:

Hi all,

I'm new to RNA-seq analysis, and I'm struggling to solve a problem.

I have 8 samples: 2 treatments with 4 reps each, so I'm just trying to do a simple differential gene expression comparison. I have strand-specifc, 100bp paired end read libraries for all these, and I'm using R to generate a count matrix with the function summarizeOverlaps and using the following code: 

for(i in 1:length(my.bam.files)){
  for(j in 1:13){
    print(j)
    chr.param<-ScanBamParam(which=itag2.4.gr[j], flag=my.flag, what=c("mapq", "flag"))
    bam.ga <- readGAlignments(my.bam.files[i], param=chr.param)
    good.reads <- which(mcols(bam.ga)$flag %in% good.flags & mcols(bam.ga)$mapq >= mapq.cutoff)
    bam.ga <- bam.ga[good.reads]
    my.count <- summarizeOverlaps(ebg, bam.ga, mode="Union", ignore.strand=F)
    count.mat[, i] <- count.mat[, i] + assays(my.count)$counts
    rm(bam.ga, my.count)
    gc()
  }

Now in my understanding, a stringent analysis would just keep 1 read from each aligned pair. I tried to do this by keeping either reads with flags 83 and 99 (proper, mapped pairs, R1 reads) or reads with flags 147, 163 (proper mapped pairs, R2 reads). As I understood it, I should generate the exact same count matrix regardless of which pair I keep, as R1 and R2 reads should map to the same gene. However, this is not the case, as keeping R2 reads generates a matrix with a count matrix that looks fairly normal (about a third with no expression, and then a mix of low, medium, and high count numbers). When I keep R1 reads, the count numbers are very low and very few genes have any counts at all. Am I doing something wrong? Is my understanding of what each read corresponds to incorrect?

I've heard from other researchers that it's ok to just keep both reads and do the analysis on that, but I'm not confident now that that will give me a robust answer since I'm confused about whether or not the count matrix generated is actually accurate.

Any help or suggestions appreciated!

 

 

rna-seq • 986 views
ADD COMMENTlink written 3.4 years ago by elizabeth.eckhardt910
the best thing you could do is to use software that was developed for this purpose. see htseq count for example. it will deal with strandedness and read pairs in the proper way.
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Martombo2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1840 users visited in the last hour