Question: Stringtie output files
gravatar for chipolino
20 months ago by
chipolino40 wrote:

I am a new user of StringTie and probably this question is very simple but I still don't get it... I have my sorted bam files (HISAT2 output, genome v19) and here is my StringTie command (v1.3.4):

stringtie hisat2_work/hisat2/alignments.sorted.bam -o stringtie_results/transcripts.gtf -G genes.GRCh37.gtf --rf -A stringtie_results/

As a result I have two output files: gene abundances ( and transcript annotation file (transcripts.gtf). For example, if I open, I will see this line:

Gene ID Gene Name   Reference   Strand  Start   End Coverage    FPKM    TPM
ENSG00000223972 DDX11L1 1   +   11869   14412   0.180934    0.129907    0.341143

But if I search for gene name (and gene id) DDX11L11 in transcripts.gtf I don't see it, it's absent. At the same time, I can find other genes from in transcripts.gtf, for example:

line in

ENSG00000227232 WASH7P  1   -   14363   29806   16.906973   12.345821   32.420803

corresponding line in transcripts.gtf:

StringTie   transcript  14363   29370   1000    -   .   gene_id "STRG.2"; transcript_id "STRG.2.2"; reference_id "ENST00000423562"; ref_gene_id "ENSG00000227232"; ref_gene_name "WASH7P"; cov "1.478912"; FPKM "1.061831"; TPM "2.788425";

What can be a problem here, why do I miss some genes from in my transcripts.gtf file?

rna-seq stringtie gtf • 2.3k views
ADD COMMENTlink modified 20 months ago by Kevin Blighe55k • written 20 months ago by chipolino40

Hello and welcome to biostars,

to show commands you use and file contents you should use the code button (the one with 101 010). This makes your post much more readable.

This time I did it for you.

fin swimmer

ADD REPLYlink written 20 months ago by finswimmer13k
gravatar for Kevin Blighe
20 months ago by
Kevin Blighe55k
Kevin Blighe55k wrote:

The one that was not included has coverage that falls below the threshold. It is virtually not expressed at all.

Modify the -C and -c parameter to StringTie:

-C <cov_refs.gtf> StringTie outputs a file with the given name with all transcripts in the provided reference file that are fully covered by reads (requires -G).

-c <float> Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output. Default: 2.5


ADD COMMENTlink written 20 months ago by Kevin Blighe55k

I should additionally point out that DDX11L1 is a pseudogene. So, it makes sense that it may have minimal expression if it has no promoter sequence or TSS such that transcription at a meaningful level could occur.

ADD REPLYlink written 20 months ago by Kevin Blighe55k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1665 users visited in the last hour