I have a question that seems simple, but can't find an answer to anywhere: is there a way to make tophat, when aligning reads to a transcriptome and then genome, report the reads that are aligning to the transcriptome with transcript coordinates (in addition to genomic coordinates or on its own)?
I see no options in the TopHat manual that allow you to force it to report alignments to transcripts in terms of transcript coordinates, or even just name which transcript it's aligning to in a tag or something. The option below implies that it does an alignment to the transcripts-- i.e. it does exactly what I want, it just then takes the additional step of converting back to genomic coordinates and doesn't tell me what transcript the read is coming from.
-T/--transcriptome-only Only align the reads to the transcriptome and report only those mappings as genomic mappings.
This would make several subsequent processing steps easier with my data. I have short reads (<=30 or so bp) being reported as genomic alignments, and have to go back and try and figure out what transcript they're coming from; it's fine when there's only one annotation in a region, but when there are overlapping annotations it becomes much more difficult. The transcripts ENST00000264933 and ENST00000316418 from Ensembl (GRCh37.p13) partially overlap, and I have a read which TopHat correctly splices in favor of ENST00000264933.4's pattern based on 2 nucleotides at the end of the read; however, because the read is reported as a genomic alignment and the first 18+ nucleotides are coming from a genomic region that is annotated as part of both transcripts, it is difficult for me to build a parser to decide which transcript it should be assigned to. TopHat's splicing pattern tells me that it already made the decision, it just didn't report it in a way that I can easily understand. Is there a way to have TopHat report the coordinates (or even just the ID) of the transcript it is aligning the read to, either in place of or in addition to genomic coordinates? (Note that I cannot use just the transcriptome in place of a genome, because some of the reads in the same are from unspliced mRNA and need the genome sequence in order to be aligned properly, unless I am misunderstanding something.)
Thanks very much.
If you want alignments to the transcriptome, then just align to the transcriptome using bowtie or similar.
As mentioned, some of the reads in the sample are coming from unspliced mRNA containing introns, which won't align to just the transcriptome. I suppose it could perhaps be achieved by doing two separate alignments, first against the transcriptome, taking the unmapped reads from that and aligning them to the genome, but that seems burdensome for something that is seemingly only a single step from what TopHat already does.
What you describe is exactly what tophat does internally. That's among the reasons it's so painfully slow.
Yes, I know-- what I meant was having to do it in two separate steps, the first time feeding it a "genome" containing each CDS as a separate entry, so that it would report the "genomic coordinates" as transcript alignments, and then take the unmapped.bam and run it through tophat again against the real genome. But even then, I would need to be able to do a similar process of assigning the unspliced reads to features anyway, so that would be of marginal use to me.
TopHat is obsolete, and no longer recommended by its developers. I suggest you use a different aligner.
I was actually advised by the bioinformatics office here, when I asked this same question of them, that using something other than tophat would probably require additional justification because it's still the standard. In any case, HISAT2 seems to have the same functionality and no clearly documented option that would do what I wanted-- are there any in particular you know would be able to do this?
Personally, I recommend BBMap, which I wrote. Other people often recommend STAR. I have never heard anyone recommend TopHat. I mean, ever. It's just not a good program.
Hisat2 is the next thing put out by TopHat developers, so if you want to be very safe, I highly recommend you use it. No conservative person will criticize you for that choice.