Does anyone have any suggestions for split read mapping to circular reference genomes, or whether there is a way to modify the Hisat2-Stringtie-Ballgown pipeline so it work a little better in this scenario?
I’ve been analysing some RNA-seq data in which circular plasmids were transfected into human cell lines, with a view to looking at alternative splicing (and isoform discovery). After analysing the data I’ve realised that I am getting both sense and anti-sense transcripts which span position 1 (antisense transcription running from the 5’ end of the linear plasmid sequence to the 3' end) or end of the plasmid sequence (due to read through of a polyA site at the 3’ end of the plasmid sequence into the 5' end of the linear sequence). I have been using Hisat2-Stringtie-Ballgown but these ‘read through’ transcripts are confusing the mapping (read pairs marked as discordant when they aren’t really) and the transcript assembly, which is resulting in exons being mapped which start at position 1 or end at the last nucleotide of the reference sequence. These are then incorrectly assembled into the transcripts.
I thought about stitching a number of the linear sequences together but this would obviously result in non-uniquely mapped reads which would impact the transcript assembly. I guess I could also modify bit-wise flags for a subset of the reads to make them concordant when spanning the the first and last bp but am not sure what effect if any this would have on Stringtie and the assembly of the transcripts.
Any suggestions from anyone who has done anything similar would be welcome!