Question: StringTie final sample gtf
gravatar for zoegward
3.9 years ago by
zoegward100 wrote:

Hi, I am trying to do RNA Seq differential expression using StringTie and then either Ballgown or DESEQ2. I have followed the StringTie manual and have used the gencode.gtf as a reference annotation file: stringtie -p 4 -G ../gencode.gtf -o ./stringtie/sample1.gtf ./sample1.bam

This produces a gtf file for all samples like so:

1       StringTie       transcript      131025  134836  1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; reference_id "ENST00000442987.3"; ref_gene_id "ENSG00000233750.3"; ref_gene_name "CICP27"; cov "0.097128"; FPKM "0.038404"; TPM "0.050769";


I then used the merge option (stringtie --merge -p 4 -G ../../gencode.gtf -o stringtie_merged.gtf ../mergelist.txt) to merge all of the transcript information from all of the samples to create a 'master' gtf file that in my understanding represents all of the feature information(an annotation file of the transcriptome for my RNA seq data). e.g.

1       HAVANA  transcript      131025  134836  .       +       .       gene_id "ENSG00000233750.3"; transcript_id "ENST00000442987.3"; gene_name "CICP27"; ref_gene_id "ENSG00000233750.3";

I then ran stringtie again using the merged gtf file to obtain a final gtf file with FPKM and coverage for all samples, eg

stringtie -e -B -p 4 -G ./stringtie_merged.gtf  -o ballgown/sample1/sample1.gtf ./sample1.bam

To produce a final sample gtf:

1       StringTie       transcript      131025  134836  1000    +       .       gene_id "ENSG00000233750.3"; transcript_id "ENST00000442987.3"; ref_gene_name "CICP27"; cov "0.126465"; FPKM "0.053889"; TPM "0.082197";

Which is all good. However, for this final sample gtf against all of the HAVANA and ENSEMBL entries the FPKM and coverage all = 0 eg

1       HAVANA  exon    129055  129173  .       -       .       gene_id "ENSG00000238009.6"; transcript_id "ENST00000471248.1"; exon_number "3"; ref_gene_name "RP11-34P13.7"; cov "0.0";

1       ENSEMBL transcript      120725  133723  .       -       .       gene_id "ENSG00000238009.6"; transcript_id "ENST00000610542.1"; ref_gene_name "RP11-34P13.7"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";

My question is why are these HAVANA/ENSEMBL entries there? My transcript for position 131025 in the example above originally came from the gencode.gtf but in the final gtf is labelled with Stringtie which I don't particularly care about but why are there lots of entries for HAVANA/ENSEMBL with values of zero in the final sample gtf?? When I first saw this final sample gtf my first thoughts were it was due to the reference.fasta/BAM files not being numerically ordered as was suggested by this post (see threads #4 and #6)

rna-seq • 2.8k views
ADD COMMENTlink modified 3.9 years ago by geek_y11k • written 3.9 years ago by zoegward100
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1434 users visited in the last hour