Question

How to extract sense and antisense read counts from mapping file (.bam) for RNA-seq data set

0

Entering edit mode

10.7 years ago

M K ▴ 660

Hi All,

I mapped my specific strand RNA-seq data to human genome hg19, and I want to extract sense and antisense read counts from mapping file (.bam) . Can any one help me to do that.

next-gen RNA-Seq R • 7.1k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by M K ▴ 660

Ram · Answer 1 · 2014-11-05

2

Entering edit mode

10.7 years ago

Devon Ryan 105k

Use either featureCounts or htseq-count. Running them twice on the same dataset with the strand setting reversed (e.g., -s yes and then -s reverse) will produce what you want.

ADD COMMENT • link 10.7 years ago by Devon Ryan 105k

0

Entering edit mode

so how can I compare the results with the annotation file GTF

ADD REPLY • link 10.7 years ago by M K ▴ 660

0

Entering edit mode

Huh? Why would you need to and what would that even mean?

ADD REPLY • link 10.7 years ago by Devon Ryan 105k

0

Entering edit mode

I want to detect antisense using R package called NASTI-seq and they using modified GTF file to compare the results with the mapped file .bam to count reads

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by M K ▴ 660

1

Entering edit mode

Always mention in the beginning when you intend to use a particular package, especially when it's an obscure one.

If you look at the NASTIseq documentation, you'll see that the input is a list with the following 4 items:

smat: A matrix of sense counts (these counts can be produced via my answer above).
asmat: A matrix of antisense counts (these counts can be produced via my answer above).
genepos: Your GTF file, presumably as a GRanges object or maybe a data.frame, but they don't specify this. It appears that they just need gene bounds (this is actually poorly thought out, but doable with GenomicFeatures).
pospairs: A 2-column data.frame with known sense-antisense pairs. The values should match row.names from genepos.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks Ryan for helping me, and I am working on featureCounts to get sense and antisense matrix, but I didn't understand how they get genepos file so do you have any idea about that.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by M K ▴ 660

0

Entering edit mode

The documentation for that package is absolutely terrible (please feel encouraged to contact the author and complain...when a paper over an R/other software package is in review, we really need to include the documentation/tutorial in the review process), so I'm surprised you're stumped. It turns out that you have to look through the source code to see what's actually used from genepos and what things need to be named as.

Assuming you have a reasonably well formatted GTF for your organism (i.e., not from UCSC), then something like the following R code should work to create genepos.

library(rtracklayer)
gtf <- import.gff("Mus_musculus.GRCm38.71.gtf", format="gtf")

stupidNASTIseqFunction <- function(gtf) {
    #compact each gene to a single entry
    grl <- split(gtf, gtf$gene_id)
    #This won't work on the weird GTF files from UCSC
    res <- lapply(grl, function(x) {
        start(x)[1] = min(start(x))
        end(x)[1] = max(end(x))
        x[1,]
    })
    #format the output data frame
    df <- data.frame(
        seqname = factor(sapply(res, function(x) as.character(seqnames(x)))),
        start = sapply(res, start),
        end = sapply(res, end),
        strand = factor(sapply(res, function(x) as.character(strand(x)))),
        attributes=sapply(res, function(x) x$gene_id))
    #reorder things, since grl is ordered be gene_id
    o <- match(unique(gtf$gene_id), df$attribute)
    df[o,]
}

genepos <- stupidNASTIseqFunction(gtf)

You'll need to change the import.gff(...) part to match your own GTF file. For what it's worth, this method is rather slow with large GTF files, but you'll only need to do it once.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Devon Ryan 105k

0

Entering edit mode

Dear Devon, I am trying to extract sense and antisense read percentage for globin genes. I used htseq-count to obtain stats. I have posted a question on comparing strand setting as yes, no and reverse How to interpret the difference among these three options in strandedness from HTSeq-count. I need some help on interpreting the results can you help.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 9.8 years ago by nalandaatmi ▴ 110