Calculation of read counts in RNA-seq
1
1
Entering edit mode
5.6 years ago
statfa ▴ 650

Hi everybody,

I'm reading the steps of generation of read counts in an RNA-seq study which are as follows:

"We used Casava 1.8.2 (Illumina) to demultiplex and process the raw reads into fastq files.

RNA-seq reads in FASTQ files were aligned to hg19 using Tophat 2.0.0

Reads were assigned to annotated genes and counted using the "summarizeOverlaps" function from the GenomicRanges Bioconductor package in "IntersectionNotEmpty" mode (TODO cite GenomicRanges: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118 ).

Overlaps were computed to features of type "exon" with a Parent feature has type "mRNA" (excluding "ncRNA" in particular), grouped by gene ID (the Parent property of the mRNA feature to which each exon belongs)."

I can understand the first three steps, but I have a hard time understanding why overlaps were computed to features of type "exon"...

What is the last step?

Aren't the first three steps enough to calculate read counts?

Thanks

read counts RNA-seq • 2.5k views
2
Entering edit mode

I think the last step is just explaining in more detail the kind of summarization done with summarizeOverlaps(). Take a look at the summarizeOverlap vignette in the GenomicAlignments package.

2
Entering edit mode

Indeed, the last paragraph seems to discuss which features were selected to perform counting on: overlaps on exons from mRNA but not ncRNA and grouped by gene ID (gene name)

0
Entering edit mode

So does it mean the first three steps are enough to count reads?

2
Entering edit mode

Yes. But in order to replicate their results you need to consider the last sentence, as it tells you how exactly the counting was done. Different counting strategies may give you different counts.

1
Entering edit mode

WikiPedia (Alternative_splicing) to the rescue.

1
Entering edit mode
5.6 years ago
benformatics ★ 2.8k

Given that they did this in R using GenomicRanges/GenomicFeatures, my assumption would be that the last section refers to the exact ranges that were used in the summarizeOverlaps() call. In this case they probably used their alignments (BAM files) and overlapped them with a GRangesList that had one entry per gene each containing all of the potential annotated exons.

If you were to try to extract this information from a TranscriptDb (txdb) the following function would likely have been used (assuming txdb is a working transcript db object):

gene.exons <- transcriptsBy(txdb,'gene')