Question: MSTRGs are plaguing my RNA-Seq analysis
1
gravatar for shintzen
7 months ago by
shintzen10
shintzen10 wrote:

Hi, I have been trying to run RNA seq analysis on some paired end data. I have aligned on HISAT2, and run Stringtie, Stringtie Merge and then Stringtie again. To do the analysis I am using: grch38_tran.tar.gz - https://ccb.jhu.edu/software/hisat2/index.shtml Homo_sapiens.GRCh38.84.gtf - ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz

My issue is that despite running stringtie again after merge to remove some of the MSTRGs, I am getting a large number of them in my data set. More alarmingly the MSTRGs that do exist represent the highest counts in my sample.HISAT2-2.1.0.aligned.sorted.StringTie.1.3.3.gene_count_matrix.

Number of each: 24801 mstrg / 33970 ensg

Fraction of total: .42199 mstrg / .57800 ensg

Sum of each counts: 78615368 mstrg / 778402 ensg

Fraction of counts: .99019 mstrg / .00980 ensg

So while the MSTRG only makes up ~42% of the gene ids, it is 99% of what has been counted. I have minimum coverage set to 5, and have -G set, as well as -e to restrict to the reference given.

Is there anyway to further optimize this? Have I missed out on an important step?

rna-seq alignment • 458 views
ADD COMMENTlink modified 7 months ago by andrew.j.skelton735.8k • written 7 months ago by shintzen10

Do you need to run stringtie? Do you expect new transcripts and does your project requires dealing with them? Why don't you quantify against the reference transcriptome/GTF with tools like featureCounts or use transcript quantifiers like salmon or kallisto?

ADD REPLYlink written 7 months ago by ATpoint21k
4
gravatar for andrew.j.skelton73
7 months ago by
London
andrew.j.skelton735.8k wrote:

This has always been an issue as far as novel transcript discovery goes, you can see a lot of hits. Keep in mind that the vast majority of these are very slight changes to known transcripts and splice events, which are generally meaningless. When performing this kind of analysis I generally get rid of any MSTRG ID that falls within a known annotation, and then look for protein coding potential of transcripts identified, finally prioritising on abundance. I'll then go through a short list of these transcripts and visualise them in IGV to see if they're convincing.

A lot of this prioritisation I've been able to do with awk, and drastically reducing noise with TACO, as a replacement for stringtie-merge. TACO also includes a utility to compare your merged GTF against a reference GTF, which is handy for subsetting.

ADD COMMENTlink written 7 months ago by andrew.j.skelton735.8k

From what I read I thought the same as you described where it is an expected issue, and just needed some external confirmation that there was nothing super obvious that I was missing as far as analysis and settings. Thanks

ADD REPLYlink written 7 months ago by shintzen10

Andrew, what is the best way to annotate my taco transcripts using my human reference .gtf? Like I need a gene ID and symbol for each one. I could use the co-ordinates and write a script, but what do you do?

ADD REPLYlink written 4 months ago by chris86290

Take a look at the taco_refcomp binary bundled with TACO, it's also in the manual on the website. That's how I typically do annotation of output, and then some awk to filter it to whatever I'm interested in. Hope that helps.

./taco_refcomp -o <output_directory> -r <reference_gtf> -t <test_gtf> --cpat (optional flag to run coding potential prediction)
ADD REPLYlink written 4 months ago by andrew.j.skelton735.8k

OK thanks, seems like a nice alternative to cuffmerge etc.

ADD REPLYlink written 4 months ago by chris86290
2
gravatar for patelk26
7 months ago by
patelk2620
patelk2620 wrote:

Have you tried option -c? This flag will output a file with all transcripts in the provided reference file that are fully covered by the reads. This flag will require Reference annotation file (-G) to be provided.

ADD COMMENTlink written 7 months ago by patelk2620
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1576 users visited in the last hour