Question

StringTie with Ensembl annotation: Fuzzy exon identification

0

Entering edit mode

7.4 years ago

jonasmst ▴ 410

I'm using StringTie with Ensembl annotations (GTF-file downloaded from Ensembl FTP --> Gene sets --> GTF) and I'm having an issue with exon variants with slightly different genomic positions. Some exons have start positions that differ with as low as 1bp (e.g. one starts at 1001, another starts at 1002), and the same with the stop-positions. As a result, StringTie gives me two different coverage values, one for each of the exons. I would like to treat two such exons as one and the same, and I'm wondering how to go about it.

I can't find a suitable option in the StringTie manual, so I'm considering altering the annotation; something like finding exons with very small differences in start- or stop-positions, and keep only those with the lowest start position and highest stop-position, and re-run StringTie with the new annotation. Is there something obviously flawed with this approach?

Does anyone know of a way to either:

Make StringTie treat almost-identical exons as one and the same exon, or
Change the annotation to only contain the longest variant of each exon?

Thanks!

StringTie Ensembl Annotation • 2.3k views

ADD COMMENT • link updated 7.4 years ago by Matteo Schiavinato ★ 3.6k • written 7.4 years ago by jonasmst ▴ 410

score 1 · Answer 1 · 2016-12-19

1

Entering edit mode

7.4 years ago

Matteo Schiavinato ★ 3.6k

Hi @jonasmst,

AFAIK, there is no such option. Moreover, I would consider it dangerous if there was one. Altering an annotation is easy and can have very heavy consequences if done without previous experience and foresight. I personally wouldn't recommend it to you.

Another point of discussion is wether having two different exons that start at 1 nt difference are, or not, to be considered identical. If they are coding exons, 1 nucleotide means a shift in the whole open reading frame of the exon, therefore almost all the amino acids are different when translating that chunk. Are you sure you want to lose this information? Clearly, sometimes you do want that and you just need to quantify the expression of a locus regardless of which protein is encoded.

Finally, a consideration: StringTie has its power in assembling transcript sequences creating a new GTF file that you can then use to quantitate. You can also use it without the newly discovered transcripts (I think it's the -G option or something similar). Did you use it? Otherwise, it may be that it generated a new transcript from some reads that it has found which were starting mapping 1 nt after the start of the exon, hence the very little difference between the two transcripts.

I hope these considerations were useful and answered implicitly your questions!

ADD COMMENT • link 7.4 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Thanks, @Macspider, this touces upon my main concern on whether this approach would be "safe"; am I losing something essential? I'm looking at alternative splicing, and if I understand my supervisor correctly, a few bases difference is not really of interest, but rather whether the exon is expressed in a sample or not. The problem I'm encountering is along the lines of this example: An exon A appears to be expressed in samples 1-10, but not in sample 11 (according to StringTie's cov value). Looking more closely at the gene in sample 11, I find there's another exon B that is similary expressed as A in samples 1-10, only that B in sample 11 begins 1nt before exon A. So the total expression of all the variants for the exon is very similar in all samples, it's just that another, very slightly different exon variant "gets" all the coverage in sample 11, and this is what I'm trying to address. Did this make sense? EDIT: Also, so far, I've only seen this problem in first and last exons in a transcript, so I'm wondering if this has to do with transcription start sites or polyadenylation differences (and I'm not currently interested in distinguishing exons on that level of detail)

ADD REPLY • link 7.4 years ago by jonasmst ▴ 410

0

Entering edit mode

My guess is that you have leakage of coverage at the transcript margins, and this biases the output of stringtie, which always tries to reconstruct its own reference. Give a try to cufflinks and see if the results are the same or not with the same annotation!

ADD REPLY • link 7.4 years ago by Matteo Schiavinato ★ 3.6k