Hi,
I'm curious to know if there is a standard tool or method to perform the following function:
Given some set of genomic alignments (e.g. a set of .bam files aligning to hg19), generate alignments to a set of transcripts from this genome represented by e.g. a GTF file.
So, I've seen people talk about doing the opposite before; going from alignments to the transcriptome and projecting them back to the genomic coordinates, but I want to go the other way --- a sort of "un-projection". Particularly (and this is key), alignments to a single genomic origin that correspond to multiple isoforms of a gene should generate multiple, output alignments.
Does anyone know of any software that would allow me to perform such processing?
Do you actually want the transcriptome coordinates or do you just want counts of things? The latter is more common since the former tends to not be useful.
Hi Devon,
I actually would like the transcriptome coordinates. Literally, I want to project the genomic alignments onto all annotated transcriptomes. I realize this makes the problem more burdensome, which is why I came here to see if anyone has attempted something similar.
I'd be surprised if there's not something prewritten to do this, but I'm not personally aware of it. If you've not found anything then you could always write something up. Using Rsamtools and GenomicFeatures should make this an easy enough thing to code (yes, that will be a bit slow).
What is the output? - a SAM record or otherwise reasonably complete alignment to the transcript?
Yes. The tool I'd imagine would look something like this.
Input: GTF file describing potential target transcripts, BAM/SAM alignment to the genome.
Output: SAM/BAM alignment to the target transcripts identified in the GTF file, where genomic alignments have been "expanded" to all of the transcripts they cover (i.e. a read may be unique in genomic location, but map to potentially many transcript --- all of these alignments should be output).
Like I said before --- I know of tools for going the other way, but not for going from genome -> transcriptome.
Interesting concept, I don't know of a tool that does this but it feels quite useful and possibly not that complicated (though I might not fully understand all the implications).
Wouldn't it be a matter of just shifting coordinates by a translation, the POS field -> Alignment POS - Each transcript's leftmost POS -> New POS, the CIGAR is already relative to the alignment.
Well, I agree that it's not that complicated, conceptually (though I see it taking a little time to round out all the rough edges). The motivation (mine at least) would be to be able to use existing alignments to a genome with RNA-seq quantification tools like RSEM, eXpress and (my new tool) Salmon, that work based off of alignments relative to a transcriptome.
Aha, now I get the rationale, not having to realign the sequences would indeed make it a whole lot easier to evaluate another transcript base methodology and would head off the criticism of not using the whole genome.
I think just the conversion tool on its own would be a quite the helpful tool in our arsenal!