MSTRGs are plaguing my RNA-Seq analysis
2
3
Entering edit mode
3.0 years ago
shintzen ▴ 30

Hi, I have been trying to run RNA seq analysis on some paired end data. I have aligned on HISAT2, and run Stringtie, Stringtie Merge and then Stringtie again. To do the analysis I am using: grch38_tran.tar.gz - https://ccb.jhu.edu/software/hisat2/index.shtml Homo_sapiens.GRCh38.84.gtf - ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz

My issue is that despite running stringtie again after merge to remove some of the MSTRGs, I am getting a large number of them in my data set. More alarmingly the MSTRGs that do exist represent the highest counts in my sample.HISAT2-2.1.0.aligned.sorted.StringTie.1.3.3.gene_count_matrix.

Number of each: 24801 mstrg / 33970 ensg

Fraction of total: .42199 mstrg / .57800 ensg

Sum of each counts: 78615368 mstrg / 778402 ensg

Fraction of counts: .99019 mstrg / .00980 ensg

So while the MSTRG only makes up ~42% of the gene ids, it is 99% of what has been counted. I have minimum coverage set to 5, and have -G set, as well as -e to restrict to the reference given.

Is there anyway to further optimize this? Have I missed out on an important step?

rna-seq alignment • 1.6k views
0
Entering edit mode

Do you need to run stringtie? Do you expect new transcripts and does your project requires dealing with them? Why don't you quantify against the reference transcriptome/GTF with tools like featureCounts or use transcript quantifiers like salmon or kallisto?

5
Entering edit mode
3.0 years ago

This has always been an issue as far as novel transcript discovery goes, you can see a lot of hits. Keep in mind that the vast majority of these are very slight changes to known transcripts and splice events, which are generally meaningless. When performing this kind of analysis I generally get rid of any MSTRG ID that falls within a known annotation, and then look for protein coding potential of transcripts identified, finally prioritising on abundance. I'll then go through a short list of these transcripts and visualise them in IGV to see if they're convincing.

A lot of this prioritisation I've been able to do with awk, and drastically reducing noise with TACO, as a replacement for stringtie-merge. TACO also includes a utility to compare your merged GTF against a reference GTF, which is handy for subsetting.

0
Entering edit mode

From what I read I thought the same as you described where it is an expected issue, and just needed some external confirmation that there was nothing super obvious that I was missing as far as analysis and settings. Thanks

0
Entering edit mode

Andrew, what is the best way to annotate my taco transcripts using my human reference .gtf? Like I need a gene ID and symbol for each one. I could use the co-ordinates and write a script, but what do you do?

0
Entering edit mode

Take a look at the taco_refcomp binary bundled with TACO, it's also in the manual on the website. That's how I typically do annotation of output, and then some awk to filter it to whatever I'm interested in. Hope that helps.

./taco_refcomp -o <output_directory> -r <reference_gtf> -t <test_gtf> --cpat (optional flag to run coding potential prediction)

0
Entering edit mode

OK thanks, seems like a nice alternative to cuffmerge etc.

2
Entering edit mode
3.0 years ago
patelk26 ▴ 260

Have you tried option -c? This flag will output a file with all transcripts in the provided reference file that are fully covered by the reads. This flag will require Reference annotation file (-G) to be provided.