Question

StringTie final sample gtf

2

Entering edit mode

7.1 years ago

samuel ▴ 240

Hi, I am trying to do RNA Seq differential expression using StringTie and then either Ballgown or DESEQ2. I have followed the StringTie manual and have used the gencode.gtf as a reference annotation file: stringtie -p 4 -G ../gencode.gtf -o ./stringtie/sample1.gtf ./sample1.bam

This produces a gtf file for all samples like so:

1       StringTie       transcript      131025  134836  1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; reference_id "ENST00000442987.3"; ref_gene_id "ENSG00000233750.3"; ref_gene_name "CICP27"; cov "0.097128"; FPKM "0.038404"; TPM "0.050769";

etc

I then used the merge option (stringtie --merge -p 4 -G ../../gencode.gtf -o stringtie_merged.gtf ../mergelist.txt) to merge all of the transcript information from all of the samples to create a 'master' gtf file that in my understanding represents all of the feature information(an annotation file of the transcriptome for my RNA seq data). e.g.

1       HAVANA  transcript      131025  134836  .       +       .       gene_id "ENSG00000233750.3"; transcript_id "ENST00000442987.3"; gene_name "CICP27"; ref_gene_id "ENSG00000233750.3";

I then ran stringtie again using the merged gtf file to obtain a final gtf file with FPKM and coverage for all samples, eg

stringtie -e -B -p 4 -G ./stringtie_merged.gtf  -o ballgown/sample1/sample1.gtf ./sample1.bam

To produce a final sample gtf:

1       StringTie       transcript      131025  134836  1000    +       .       gene_id "ENSG00000233750.3"; transcript_id "ENST00000442987.3"; ref_gene_name "CICP27"; cov "0.126465"; FPKM "0.053889"; TPM "0.082197";

Which is all good. However, for this final sample gtf against all of the HAVANA and ENSEMBL entries the FPKM and coverage all = 0 eg

1       HAVANA  exon    129055  129173  .       -       .       gene_id "ENSG00000238009.6"; transcript_id "ENST00000471248.1"; exon_number "3"; ref_gene_name "RP11-34P13.7"; cov "0.0";

1       ENSEMBL transcript      120725  133723  .       -       .       gene_id "ENSG00000238009.6"; transcript_id "ENST00000610542.1"; ref_gene_name "RP11-34P13.7"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";

My question is why are these HAVANA/ENSEMBL entries there? My transcript for position 131025 in the example above originally came from the gencode.gtf but in the final gtf is labelled with Stringtie which I don't particularly care about but why are there lots of entries for HAVANA/ENSEMBL with values of zero in the final sample gtf?? When I first saw this final sample gtf my first thoughts were it was due to the reference.fasta/BAM files not being numerically ordered as was suggested by this post (see threads #4 and #6) http://seqanswers.com/forums/showthread.php?t=8218

RNA-Seq • 3.7k views

ADD COMMENT • link updated 7.1 years ago by GouthamAtla 12k • written 7.1 years ago by samuel ▴ 240