Hi, I have been trying to run RNA seq analysis on some paired end data. I have aligned on HISAT2, and run Stringtie, Stringtie Merge and then Stringtie again. To do the analysis I am using: grch38_tran.tar.gz - https://ccb.jhu.edu/software/hisat2/index.shtml Homo_sapiens.GRCh38.84.gtf - ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
My issue is that despite running stringtie again after merge to remove some of the MSTRGs, I am getting a large number of them in my data set. More alarmingly the MSTRGs that do exist represent the highest counts in my sample.HISAT2-2.1.0.aligned.sorted.StringTie.1.3.3.gene_count_matrix.
Number of each: 24801 mstrg / 33970 ensg
Fraction of total: .42199 mstrg / .57800 ensg
Sum of each counts: 78615368 mstrg / 778402 ensg
Fraction of counts: .99019 mstrg / .00980 ensg
So while the MSTRG only makes up ~42% of the gene ids, it is 99% of what has been counted. I have minimum coverage set to 5, and have -G set, as well as -e to restrict to the reference given.
Is there anyway to further optimize this? Have I missed out on an important step?
Do you need to run
stringtie? Do you expect new transcripts and does your project requires dealing with them? Why don't you quantify against the reference transcriptome/GTF with tools likefeatureCountsor use transcript quantifiers likesalmonorkallisto?