Hi,
So, I ran Stringtie on data derived from 38 samples, according to the protocol mentioned in the paper - https://www.nature.com/articles/nprot.2016.095. I generated the merged gtf file after running stringtie merge
(using the -G option), and then I performed transcript abundance calculation.
After that, I used the method given here to run tximport using the Stringtie files. However, upon doing that, I saw that tximport just gave one large number for each sample, instead of raw counts for each gene. So, I tried to look at the data carefully and found that there was no gene_names in the stringtie_merged gtf file, as a result of which (probably) tximport wasn't able to give separate counts for the genes/transcripts. (In comparison, the merged.gtf file in their paper does have gene_names at least for some transcripts)
Then, I searched online and found a code by the developer here that is supposed to append gene_names to the merged.gtf file, but on using it, the output gtf still doesn't have any gene names.
TLDR - what should I do to ensure that the merged.gtf file has the gene_names so that tximport can assign raw counts to the transcripts/genes?
I need to ask you two questions before being able to answer you: 1)Is it only gene-names that are the problem or do you also lack gene_ids? 2) When you ran stringtie --merge did you use the -G option to include a refrence?
Hi Kristoffer,
1) I do see gene_id's in the stringtie merged file, BUT, there are 2.1 million lines in the
stringtie_merged.gtf
and for 1.95 million of them, the gene_id is of the formMSTRG.x
- only the last 0.15 million lines have gene_id of the formENSG...
.2) Yes, I did use the -G option :
stringtie --merge -p 8 -G /Volumes/bam/DRG/Homo_sapiens.GRCh38.97.gff3