Question: StringTie final sample gtf
2
gravatar for zoegward
2.7 years ago by
zoegward70
zoegward70 wrote:

Hi, I am trying to do RNA Seq differential expression using StringTie and then either Ballgown or DESEQ2. I have followed the StringTie manual and have used the gencode.gtf as a reference annotation file: stringtie -p 4 -G ../gencode.gtf -o ./stringtie/sample1.gtf ./sample1.bam

This produces a gtf file for all samples like so:

1       StringTie       transcript      131025  134836  1000    +       .       gene_id "STRG.1"; transcript_id "STRG.1.1"; reference_id "ENST00000442987.3"; ref_gene_id "ENSG00000233750.3"; ref_gene_name "CICP27"; cov "0.097128"; FPKM "0.038404"; TPM "0.050769";

etc

I then used the merge option (stringtie --merge -p 4 -G ../../gencode.gtf -o stringtie_merged.gtf ../mergelist.txt) to merge all of the transcript information from all of the samples to create a 'master' gtf file that in my understanding represents all of the feature information(an annotation file of the transcriptome for my RNA seq data). e.g.

1       HAVANA  transcript      131025  134836  .       +       .       gene_id "ENSG00000233750.3"; transcript_id "ENST00000442987.3"; gene_name "CICP27"; ref_gene_id "ENSG00000233750.3";

I then ran stringtie again using the merged gtf file to obtain a final gtf file with FPKM and coverage for all samples, eg

stringtie -e -B -p 4 -G ./stringtie_merged.gtf  -o ballgown/sample1/sample1.gtf ./sample1.bam

To produce a final sample gtf:

1       StringTie       transcript      131025  134836  1000    +       .       gene_id "ENSG00000233750.3"; transcript_id "ENST00000442987.3"; ref_gene_name "CICP27"; cov "0.126465"; FPKM "0.053889"; TPM "0.082197";

Which is all good. However, for this final sample gtf against all of the HAVANA and ENSEMBL entries the FPKM and coverage all = 0 eg

1       HAVANA  exon    129055  129173  .       -       .       gene_id "ENSG00000238009.6"; transcript_id "ENST00000471248.1"; exon_number "3"; ref_gene_name "RP11-34P13.7"; cov "0.0";

1       ENSEMBL transcript      120725  133723  .       -       .       gene_id "ENSG00000238009.6"; transcript_id "ENST00000610542.1"; ref_gene_name "RP11-34P13.7"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";

My question is why are these HAVANA/ENSEMBL entries there? My transcript for position 131025 in the example above originally came from the gencode.gtf but in the final gtf is labelled with Stringtie which I don't particularly care about but why are there lots of entries for HAVANA/ENSEMBL with values of zero in the final sample gtf?? When I first saw this final sample gtf my first thoughts were it was due to the reference.fasta/BAM files not being numerically ordered as was suggested by this post (see threads #4 and #6) http://seqanswers.com/forums/showthread.php?t=8218

rna-seq • 2.1k views
ADD COMMENTlink modified 2.7 years ago by geek_y10k • written 2.7 years ago by zoegward70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1667 users visited in the last hour