I'm looking for a bit of advice regarding transcript assembly and novel transcript identification.
I have about 90 BAM files from human mRNA-seq (all same tissue, ranging from 50M-90M reads). They aren't in any separate groups (e.g treatment vs control, etc), so can be thought of as biological replicates. The BAM files are the result of STAR two pass mapping, aligned to reference human genome hg38.
My goal is to identify and assemble all mRNA isoforms for which there is evidence of expression, quantify that evidence and then compare the list of isoforms for each gene with those already existing in the reference annotation, to identify any novel transcript isoforms assembled. I'd also like to visualise the results (e.g. show all transcript isoforms and identify by colour those which are novel).
Thus far I've:
- Run StringTie on all BAM files with reference annotation provided;
- Merged all resultant GTF files with
- Used GFFcompare to compare the merged GTF with the original reference annotation.
However, I'm having a hard time interpreting the output from GFFcompare, and am still at a loss as to how to identify novel transcripts, how to quantify their levels of expression in my samples and how to visualise them.
I would greatly appreciate it if someone could point me in the right direction. Thanks!