I ran Tuxedo at Galaxy, using TopHat2-Cufflinks-Cuffmerge-Cuffdiff. I expected my Cuffdiff output to contain gene_name, so that I could directly identify genes in downstream analyses. However, it seems to be missing and I only have a list of transcript ids (all isoforms) for each gene instead.
I used Reference genome at all steps (Cufflinks, Cuffdiff) downloaded from UCSC with Ensembl annotations. Now when I e.g. open a file with gene fpkm tracking, my columns tracking_id and gene_id are the same and contain XLOC ids. The column with gene_short_name contains a list of Ensembl transcript ids (although it's a gene file, it just puts all transcript ids belonging to that gene there).
So to me it looks like the columns are not filled appropriately. I wondered if somebody knows what I might have done wrong or has encountered a similar problem.
Else, I have been looking for a way to "fish-out" only the Ensembl ids I need based on a list of XLOC ids that I am interested in (e.g. a subset of 3000, or a 1000). (So, to based on e.g. a 1000 of XLOC ids subset only that 1000 of rows from the file containing all of them (i.e. to search out these rows and then assemble a data table with these rows only; they are not subsequent rows in the original file)). Any suggestions are very welcome :)