Question

nb spanning reads confusion in Tophat Fusion output

1

Entering edit mode

8.9 years ago

guillaume.rbt ★ 1.0k

Hi all,

I'm currently trying to find gene fusions with PE RNA-seq data thanks to Tophat Fusion.

I have difficulties to understand some fields of the output.

In the tophat manual it is written:

"nbSpanningReads" is the number of reads that span the fusion, "nbSpanningPairs" is the number of mate pairs that support the fusion, "nbSpanningPairsInFusion" is the number of mate pairs that support the fusion and whose one end spans the fusion.

Does it mean that the reads of nbSpanningReads are not paired? (considering that nbSpanningPairsInFusion have reads spanning the fusion)

Are nbSpanningPairs and nbSpanningPairsInFusion independent?

Thank you in advance for your ideas.

RNA-Seq • 4.0k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

Have you used tophat fusion for mouse?

I always get empty results after run tophat fusion post? and whether tophat fusion is just for paired-end data?

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by syxbestmayer ▴ 20

0

Entering edit mode

Hi,

I'm currently working on human data.

I found that tophat-fusion post is very stringent, maybe you should use tophat-fusion and filter the results by yourself to find something.

And tophat fusion can use both paired-end and single-end data.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

Filter the results by myself? it is difficult for me.

I can get results on my human data, but get nothing for my mouse data.

Can you give some advice?

And it is too difficult for me to filter the results by myself.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by syxbestmayer ▴ 20

0

Entering edit mode

Either you try tophat post again with the least stringent parameters (the documentation is here : ( http://ccb.jhu.edu/software/tophat/fusion_manual.html )

Or if you want to try to filter the results from tophat fusion with the command:

awk '{if($5 > X) print}' fusions.out | sed 's/@\t/\n/g'

with X the minimum of spanning reads supporting the fusions

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

Thank you very much, but I can't get the name of gene.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by syxbestmayer ▴ 20

0

Entering edit mode

To get the names of genes you have to compare your fusion position with the right genome annotation.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by guillaume.rbt ★ 1.0k

Ram · Answer 1 · 2015-05-20

2

Entering edit mode

8.9 years ago

ethan.kaufman ▴ 380

My understanding is that both "SpanningReads" and "SpanningPairsInFusion" originate as reads that are initially unmapped after tophat alignment, so they are by definition not part of a proper pair. (They are aligned in a subsequent phase where they are divided into segments which are aligned separately). The difference between the two is that the former does not use the other mate as evidence for the fusion event, while the latter does. In "SpanningPairs", both reads are aligned by tophat in the initial phase, but their pairing is discordant (i.e. different chromosomes, different orientations, etc). So yes, all three categories are disjoint.

I encourage you to read the paper where the algorithm is described- hopefully it will clarify things for you. I agree the documentation is lacking.

Update: I thought this diagram might help. This is how I think of it anyways. (X is the breakpoint, highlight lines starting with # are meant to indicate sequence aligning in a discordant location).

-----------X-----------  SpanningReads 
#           ^_________^
-----------------------  ----X------------------  SpanningPairsInFusion
#                             ^________________^
----------------------- X ----------------------  SpanningPairs
#                         ^____________________^

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by ethan.kaufman ▴ 380

1

Entering edit mode

Thank you very much for your response, it really helps me.

I would like to summarize a "quality score" of each fusion based on the number of spanning reads and mate pairs. Do you think it would be relevant to just add the three fields?

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by guillaume.rbt ★ 1.0k

1

Entering edit mode

I think that is probably too simplistic. You're probably aware that tophat-fusion reports its own quality score with each candidate, and that this is only loosely correlated to the total number of supporting reads. It also matters what the distribution of the supporting reads are around the breakpoint, the mapping quality of those reads, how many reads refute the event (i.e. map normally across the breakpoint), among other things. I find this score to generally be a good indicator that a fusion will look real when I visualize the alignment in IGV.

It's also worth noting that just because a fusion (or any transcript) is weakly expressed, it doesn't mean it's not biologically meaningful. With fusions its important to perform some functional classification (based, for example, on which domains are retained in the fusion) to assess what kind of function the fusion might have.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by ethan.kaufman ▴ 380

0

Entering edit mode

Thank you for your useful advice.

I wasn't aware that tophat-fusion has in-built quality scores. Considering that, it would be really interesting to me to use those scores. I haven't found those quality scores on my outputs, or it isn't specified in the documentation I use (http://ccb.jhu.edu/software/tophat/fusion_manual.html). Maybe I don't use the right parameters to compute the quality score.

Here is the command I use:

tophat2 \
  -p 8 \
  -G ./ucsc_hg19_refflat.gtf \
  --library-type fr-firststrand \
  --mate-inner-dist 175 \
  --mate-std-dev 75 \
  -o ./test_tophat2 \
  --fusion-search ./hg19 ./test.R1.fastq ./test.R2.fastq

Here is an example of my output:

chr1-chr1    565326    238105483    fr    11    0    0    6027    54    46    0.347107    @    3 5 7 10 14     @    ACCAATACCACCAATCAATACTCATCATTAATAATCATAATGGCTATAGC AATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCC    @    ATAAATACTATTAATCAATTTTCATCCTTAATAATAATAATGGTTCTAGT AATAAAACTAGGAATAGCCACCTTTCACTTCTGAGTCCCAGAGGTAACCC    @    11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 10 10 10 10 9 8 6 5 3 3 2 1 1 1 1 1 1 1 1 1 @    11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 10 10 10 10 10 10 10 10 10 10 10 10 10 9 8 8 6 5 3 2 1 1 1 1 0 0 0 0     @

Could you tell me how to obtain this quality score, or if it is present somewhere is this output?

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

Ah, that's only the intermediate output. You need to run tophat-fusion-post to filter these (that's where the scores are reported). If you still don't see a score, then it's probably a version issue. I am running 2.0.11.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by ethan.kaufman ▴ 380