Question: Expression data from cufflinks merged gtf
gravatar for ddowlin
2.3 years ago by
ddowlin70 wrote:

Hi all,

I have assembled six transcriptomes with hisat2/stringtie. I merged the resulting gtf files into a single merged gtf.

For many transcripts in the merged gtf a transcript_id starting with 'MSTRG' was created. My understanding is that the MSTRG ids are assigned to transcripts in the merged gtf which may have different ids in the individual unmerged gtfs. This could be because there is no reference id for this transcript.

I am now interested in looking at the expression levels (TPM) of certain transcripts in the individual unmerged samples. However, I am having trouble matching the MSTRG id in the merged gtf to the corresponding ids in the unmerged gtfs.

I attempted to solve this problem using bedtools intersect to get the overlapping coordinates in the merged gtf with one of the unmerged gtfs. This allows me to map the MSTRG id to the unmerged id.

However, I now have a new problem: in some cases a single MSTRG id is assigned to multiple unmerged ids. See below for a simplified example:

22942    24454    gene_id "25"    TPM "3"    19883    26517    gene_id "MSTRG.34"
19883    22800    gene_id "26"    TPM "5"    19883    26517    gene_id "MSTRG.34"
24624    26412    gene_id "27"    TPM "5"    19883    26517    gene_id "MSTRG.34"

My questions are.

  1. Why has stringtie merged these multiple transcripts into a single transcript in the merged gtf?
  2. How can I treat the TPM values as referring to a single transcript (i.e. the MSTRG id) and if so what is the best way t o do this?
    • Get the mean TPM per gene_id
    • Sum the TPM values per gene_id

Many thanks.

rna-seq stringtie gtf bedtools • 641 views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by ddowlin70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1784 users visited in the last hour