Question: MSTRGs are plaguing my RNA-Seq analysis
1
gravatar for shintzen
10 days ago by
shintzen10
shintzen10 wrote:

Hi, I have been trying to run RNA seq analysis on some paired end data. I have aligned on HISAT2, and run Stringtie, Stringtie Merge and then Stringtie again. To do the analysis I am using: grch38_tran.tar.gz - https://ccb.jhu.edu/software/hisat2/index.shtml Homo_sapiens.GRCh38.84.gtf - ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz

My issue is that despite running stringtie again after merge to remove some of the MSTRGs, I am getting a large number of them in my data set. More alarmingly the MSTRGs that do exist represent the highest counts in my sample.HISAT2-2.1.0.aligned.sorted.StringTie.1.3.3.gene_count_matrix.

Number of each: 24801 mstrg / 33970 ensg

Fraction of total: .42199 mstrg / .57800 ensg

Sum of each counts: 78615368 mstrg / 778402 ensg

Fraction of counts: .99019 mstrg / .00980 ensg

So while the MSTRG only makes up ~42% of the gene ids, it is 99% of what has been counted. I have minimum coverage set to 5, and have -G set, as well as -e to restrict to the reference given.

Is there anyway to further optimize this? Have I missed out on an important step?

rna-seq alignment • 144 views
ADD COMMENTlink modified 9 days ago by andrew.j.skelton735.5k • written 10 days ago by shintzen10

Do you need to run stringtie? Do you expect new transcripts and does your project requires dealing with them? Why don't you quantify against the reference transcriptome/GTF with tools like featureCounts or use transcript quantifiers like salmon or kallisto?

ADD REPLYlink written 10 days ago by ATpoint12k
3
gravatar for andrew.j.skelton73
9 days ago by
London
andrew.j.skelton735.5k wrote:

This has always been an issue as far as novel transcript discovery goes, you can see a lot of hits. Keep in mind that the vast majority of these are very slight changes to known transcripts and splice events, which are generally meaningless. When performing this kind of analysis I generally get rid of any MSTRG ID that falls within a known annotation, and then look for protein coding potential of transcripts identified, finally prioritising on abundance. I'll then go through a short list of these transcripts and visualise them in IGV to see if they're convincing.

A lot of this prioritisation I've been able to do with awk, and drastically reducing noise with TACO, as a replacement for stringtie-merge. TACO also includes a utility to compare your merged GTF against a reference GTF, which is handy for subsetting.

ADD COMMENTlink written 9 days ago by andrew.j.skelton735.5k

From what I read I thought the same as you described where it is an expected issue, and just needed some external confirmation that there was nothing super obvious that I was missing as far as analysis and settings. Thanks

ADD REPLYlink written 9 days ago by shintzen10
2
gravatar for patelk26
10 days ago by
patelk2620
patelk2620 wrote:

Have you tried option -c? This flag will output a file with all transcripts in the provided reference file that are fully covered by the reads. This flag will require Reference annotation file (-G) to be provided.

ADD COMMENTlink written 10 days ago by patelk2620
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1305 users visited in the last hour