Entering edit mode
7 weeks ago
martta95
▴
10
When mapping RNA-Seq results to the reference genome using Stringtie, I find many new transcripts that match the reference genome perfectly but have a longer length at the 5' end. How should such transcripts be annotated correctly? Should I use blast to find the most closely related species in this case? Are there any alternatives to Stringtie that simultaneously map to the reference genome and assemble new transcripts with less restriction than a 1 bp mismatch?
As in the sequence does not match the reference at all?
The sequence matches the reference transcripts 100%, but the transcripts identified in the study are longer than the reference ones. Moreover I find a lot of transcript matched to reference.
and
So are you working with reference genome or reference transcriptome sequences.
If you are working with transcriptome(and if the reference you are working with is close to your species) then it is possible that
strigntie
is incorporating additional sequence at 5'-end that may or may not be real. Follow the other two comments and suggestions there in. Ultimately an experiment may need to be done to prove those extensions are real.I am not an assembly guy at all, so I always take the easy route and ask "Do you really need to know about new transcripts"? Does your analysis care, realizing that new transcripts are uncharacterised, unvalidated and not annotated functionally". Or do you simply need expression counts for downstream analysis? If the latter, then use STAR or salmon (or alternatives) and map data against genome/transcriptome annitations, get your count matrix and call it a day.
Which reference genome are you working on ? If for a non-model organism, you can talk this over with the curator (if there is one). If you are the curator or there is noone you can deal with it however you want, but it is tricky. There are many contrasting annotation approaches.
A few options
I worked with Hordeum vulgare and use reference genome morex V3. In the first stage, I used stringtie for mapping. As a result, I obtained sequences mapped to the reference genome and a group of sequences recognized as new, most of which had changes in the last exon and transcripts that were longer than the reference ones. I am interested in determining the function and potential proteins that will be produced, as the identified differences may be related to varietal variability.