Question: Stringtie output files
0
gravatar for chipolino
15 months ago by
chipolino40
chipolino40 wrote:

I am a new user of StringTie and probably this question is very simple but I still don't get it... I have my sorted bam files (HISAT2 output, genome v19) and here is my StringTie command (v1.3.4):

stringtie hisat2_work/hisat2/alignments.sorted.bam -o stringtie_results/transcripts.gtf -G genes.GRCh37.gtf --rf -A stringtie_results/gene_abund.tab

As a result I have two output files: gene abundances (gene_abund.tab) and transcript annotation file (transcripts.gtf). For example, if I open gene_abund.tab, I will see this line:

Gene ID Gene Name   Reference   Strand  Start   End Coverage    FPKM    TPM
ENSG00000223972 DDX11L1 1   +   11869   14412   0.180934    0.129907    0.341143

But if I search for gene name (and gene id) DDX11L11 in transcripts.gtf I don't see it, it's absent. At the same time, I can find other genes from gene_abund.tab in transcripts.gtf, for example:

line in gene_abund.tab:

ENSG00000227232 WASH7P  1   -   14363   29806   16.906973   12.345821   32.420803

corresponding line in transcripts.gtf:

StringTie   transcript  14363   29370   1000    -   .   gene_id "STRG.2"; transcript_id "STRG.2.2"; reference_id "ENST00000423562"; ref_gene_id "ENSG00000227232"; ref_gene_name "WASH7P"; cov "1.478912"; FPKM "1.061831"; TPM "2.788425";

What can be a problem here, why do I miss some genes from gene_abund.tab in my transcripts.gtf file?

rna-seq stringtie gtf • 1.7k views
ADD COMMENTlink modified 15 months ago by Kevin Blighe48k • written 15 months ago by chipolino40

Hello and welcome to biostars,

to show commands you use and file contents you should use the code button (the one with 101 010). This makes your post much more readable.

This time I did it for you.

fin swimmer

ADD REPLYlink written 15 months ago by finswimmer12k
1
gravatar for Kevin Blighe
15 months ago by
Kevin Blighe48k
Kevin Blighe48k wrote:

The one that was not included has coverage that falls below the threshold. It is virtually not expressed at all.

Modify the -C and -c parameter to StringTie:

-C <cov_refs.gtf> StringTie outputs a file with the given name with all transcripts in the provided reference file that are fully covered by reads (requires -G).

-c <float> Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output. Default: 2.5

Kevin

ADD COMMENTlink written 15 months ago by Kevin Blighe48k

I should additionally point out that DDX11L1 is a pseudogene. So, it makes sense that it may have minimal expression if it has no promoter sequence or TSS such that transcription at a meaningful level could occur.

ADD REPLYlink written 15 months ago by Kevin Blighe48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 883 users visited in the last hour