Question

What Is A "Coverage Island" In The Context Of Tophat?

3

Entering edit mode

11.9 years ago

Dan D 7.4k

I'm reading the Tophat manual, and I'm trying to get clarification on what exactly a "coverage island" is. According to the tophat manual:

TopHat generates its database of possible splice junctions from three sources of evidence. The first source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron.

I've done plenty of googling on this terminology, but I don't quite have a solid understanding yet. A visual would be huge. My current understanding is that the reads, being reverse-transcribed from RNA, will sometimes map to non-contiguous sites in the genome as a result of transcript assembly.

Here is a simple drawing of that understanding. What I think represents a coverage island is within the blue circle. Can someone verify or correct my understanding?

Coverage Island Conception

tophat next-gen • 3.3k views

ADD COMMENT • link updated 11.9 years ago by Ryan Thompson ★ 3.6k • written 11.9 years ago by Dan D 7.4k

score 4 · Answer 1 · 2012-06-09

I think tophat's explanation is quite clear, and in fact you've already described more or less what their are on your answer. the sentence "distinct regions of piled up reads in the initial mapping" simply means that when you look to the bam alignments resulting from your rna-seq experiment you never get continuous coverage, but regions with coverage and regions without it. those regions with coverage are what tophat defines as "coverage islands", on a non-coverage ocean metaphor. since tophat tries to find splice junctions (among other things), that paragraph just tells you that one of the first things that the algorithm does is to study consecutive or closely related regions of coverage which are separated by regions without coverage (either being introns or untranscribed exons), trying to find whether those regions correspond to the same gene splicing or not.

graphically, talking about your drawing, you're not showing all the reads of the experiment, but only a few ones on the boundaries of 2 exons. to solve this, I would draw all the reads coming from the sequencing of the orange exon on the left of that bubble, plus all the reads coming from the sequencing of the orange exon on the right of that bubble, and you would have there represented 2 coverage islands (one per each exon). I hope it's now clear to you.

score 2 · Answer 2 · 2012-06-11

I don't think the circle you have drawn is the coverage island. The concept of a "coverage island" is the computational proxy for the biological concept of an exon. Obviously you cannot say with certainty where the exons lie just from computational data alone, so to avoid implying that the algorithm has knowledge of things that it cannot possibly know (i.e. where the exons truly are), the Tophat authors use the term "coverage island" to indicate "region that is likely to be an exon based on sequencing data". Note that the Cufflinks documentation similarly uses the term "locus" as the computational proxy term for what biologists generally refer to as a gene.

So to be precise, a coverage island would simply be a contiguous region of genome with nonzero read coverage (or, if you like, coverage above a set threshold depth). And the reads that you have circled in the diagram are the "bridge" that Tophat tries to build between neighboring islands.