Question: MSTRGs are plaguing my RNA-Seq analysis
gravatar for shintzen
17 months ago by
shintzen20 wrote:

Hi, I have been trying to run RNA seq analysis on some paired end data. I have aligned on HISAT2, and run Stringtie, Stringtie Merge and then Stringtie again. To do the analysis I am using: grch38_tran.tar.gz - Homo_sapiens.GRCh38.84.gtf -

My issue is that despite running stringtie again after merge to remove some of the MSTRGs, I am getting a large number of them in my data set. More alarmingly the MSTRGs that do exist represent the highest counts in my sample.HISAT2-2.1.0.aligned.sorted.StringTie.1.3.3.gene_count_matrix.

Number of each: 24801 mstrg / 33970 ensg

Fraction of total: .42199 mstrg / .57800 ensg

Sum of each counts: 78615368 mstrg / 778402 ensg

Fraction of counts: .99019 mstrg / .00980 ensg

So while the MSTRG only makes up ~42% of the gene ids, it is 99% of what has been counted. I have minimum coverage set to 5, and have -G set, as well as -e to restrict to the reference given.

Is there anyway to further optimize this? Have I missed out on an important step?

rna-seq alignment • 824 views
ADD COMMENTlink modified 17 months ago by andrew.j.skelton735.9k • written 17 months ago by shintzen20

Do you need to run stringtie? Do you expect new transcripts and does your project requires dealing with them? Why don't you quantify against the reference transcriptome/GTF with tools like featureCounts or use transcript quantifiers like salmon or kallisto?

ADD REPLYlink written 17 months ago by ATpoint35k
gravatar for andrew.j.skelton73
17 months ago by
andrew.j.skelton735.9k wrote:

This has always been an issue as far as novel transcript discovery goes, you can see a lot of hits. Keep in mind that the vast majority of these are very slight changes to known transcripts and splice events, which are generally meaningless. When performing this kind of analysis I generally get rid of any MSTRG ID that falls within a known annotation, and then look for protein coding potential of transcripts identified, finally prioritising on abundance. I'll then go through a short list of these transcripts and visualise them in IGV to see if they're convincing.

A lot of this prioritisation I've been able to do with awk, and drastically reducing noise with TACO, as a replacement for stringtie-merge. TACO also includes a utility to compare your merged GTF against a reference GTF, which is handy for subsetting.

ADD COMMENTlink written 17 months ago by andrew.j.skelton735.9k

From what I read I thought the same as you described where it is an expected issue, and just needed some external confirmation that there was nothing super obvious that I was missing as far as analysis and settings. Thanks

ADD REPLYlink written 17 months ago by shintzen20

Andrew, what is the best way to annotate my taco transcripts using my human reference .gtf? Like I need a gene ID and symbol for each one. I could use the co-ordinates and write a script, but what do you do?

ADD REPLYlink written 14 months ago by chris86330

Take a look at the taco_refcomp binary bundled with TACO, it's also in the manual on the website. That's how I typically do annotation of output, and then some awk to filter it to whatever I'm interested in. Hope that helps.

./taco_refcomp -o <output_directory> -r <reference_gtf> -t <test_gtf> --cpat (optional flag to run coding potential prediction)
ADD REPLYlink written 14 months ago by andrew.j.skelton735.9k

OK thanks, seems like a nice alternative to cuffmerge etc.

ADD REPLYlink written 14 months ago by chris86330
gravatar for patelk26
17 months ago by
patelk26120 wrote:

Have you tried option -c? This flag will output a file with all transcripts in the provided reference file that are fully covered by the reads. This flag will require Reference annotation file (-G) to be provided.

ADD COMMENTlink written 17 months ago by patelk26120
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1267 users visited in the last hour