Question: How to extract sense and antisense read counts from mapping file (.bam) for RNA-seq data set
0
gravatar for M K
4.8 years ago by
M K460
United States
M K460 wrote:

Hi All,

I mapped my specific strand RNA-seq data to human genome hg19, and I want to extract sense and antisense read counts from mapping file (.bam) . Can any one help me to do that.

rna-seq next-gen R • 3.9k views
ADD COMMENTlink modified 4.8 years ago by Devon Ryan91k • written 4.8 years ago by M K460
2
gravatar for Devon Ryan
4.8 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

Use either featureCounts or htseq-count. Running them twice on the same dataset with the strand setting reversed (e.g., -s yes and then -s reverse) will produce what you want.

ADD COMMENTlink written 4.8 years ago by Devon Ryan91k

so how can I compare the results with the annotation file GTF

ADD REPLYlink written 4.8 years ago by M K460

Huh? Why would you need to and what would that even mean?

ADD REPLYlink written 4.8 years ago by Devon Ryan91k

I want to detect antisense using  R package called NASTI-seq and they using modified GTF file to compare the results with the mapped file .bam to count reads 

ADD REPLYlink written 4.8 years ago by M K460
1

Always mention in the beginning when you intend to use a particular package, especially when it's an obscure one.

If you look at the NASTIseq documentation, you'll see that the input is a list with the following 4 items:

  1. smat: A matrix of sense counts (these counts can be produced via my answer above).
  2. asmat: A matrix of antisense counts (these counts can be produced via my answer above).
  3. genepos: Your GTF file, presumably as a GRanges object or maybe a data.frame, but they don't specify this. It appears that they just need gene bounds (this is actually poorly thought out, but doable with GenomicFeatures).
  4. pospairs: A 2-column data.frame with known sense-antisense pairs. The values should match row.names from genepos.
ADD REPLYlink written 4.8 years ago by Devon Ryan91k

Thanks Ryan for helping me, and I am working on featureCounts to get sense and antisense matrix, but I didn't understand how they get genepos file so do you have any idea about that. 

ADD REPLYlink written 4.8 years ago by M K460

The documentation for that package is absolutely terrible (please feel encouraged to contact the author and complain...when a paper over an R/other software package is in review, we really need to include the documentation/tutorial in the review process), so I'm surprised you're stumped. It turns out that you have to look through the source code to see what's actually used from genepos and what things need to be named as.

Assuming you have a reasonably well formatted GTF for your organism (i.e., not from UCSC), then something like the following R code should work to create genepos.

library(rtracklayer)
gtf <- import.gff("Mus_musculus.GRCm38.71.gtf", format="gtf")

stupidNASTIseqFunction <- function(gtf) {
    #compact each gene to a single entry
    grl <- split(gtf, gtf$gene_id)
    #This won't work on the weird GTF files from UCSC
    res <- lapply(grl, function(x) {
        start(x)[1] = min(start(x))
        end(x)[1] = max(end(x))
        x[1,]
    })
    #format the output data frame
    df <- data.frame(
        seqname = factor(sapply(res, function(x) as.character(seqnames(x)))),
        start = sapply(res, start),
        end = sapply(res, end),
        strand = factor(sapply(res, function(x) as.character(strand(x)))),
        attributes=sapply(res, function(x) x$gene_id))
    #reorder things, since grl is ordered be gene_id
    o <- match(unique(gtf$gene_id), df$attribute)
    df[o,]
}

genepos <- stupidNASTIseqFunction(gtf)

You'll need to change the import.gff(...) part to match your own GTF file. For what it's worth, this method is rather slow with large GTF files, but you'll only need to do it once.

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Devon Ryan91k

Dear Devon, I am trying to extract sense and antisense read percentage for globin genes. I used htseq-count to obtain stats. I have posted a question on comparing strand setting as yes, no and reverse How to interpret the difference among these three options in strandedness from HTSeq-count. I need some help on interpreting the results can you help.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by nalandaatmi60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 577 users visited in the last hour