Gene_short_name missing in CuffDiff output (or how to retrieve it?)
1
0
Entering edit mode
8.5 years ago
nevev ▴ 110

Dear All,

I ran Tuxedo at Galaxy, using TopHat2-Cufflinks-Cuffmerge-Cuffdiff. I expected my Cuffdiff output to contain gene_name, so that I could directly identify genes in downstream analyses. However, it seems to be missing and I only have a list of transcript ids (all isoforms) for each gene instead.

I used Reference genome at all steps (Cufflinks, Cuffdiff) downloaded from UCSC with Ensembl annotations. Now when I e.g. open a file with gene fpkm tracking, my columns tracking_id and gene_id are the same and contain XLOC ids. The column with gene_short_name contains a list of Ensembl transcript ids (although it's a gene file, it just puts all transcript ids belonging to that gene there).

So to me it looks like the columns are not filled appropriately. I wondered if somebody knows what I might have done wrong or has encountered a similar problem.

Else, I have been looking for a way to "fish-out" only the Ensembl ids I need based on a list of XLOC ids that I am interested in (e.g. a subset of 3000, or a 1000). (So, to based on e.g. a 1000 of XLOC ids subset only that 1000 of rows from the file containing all of them (i.e. to search out these rows and then assemble a data table with these rows only; they are not subsequent rows in the original file)). Any suggestions are very welcome :)

Best regards,

Monika

RNA-Seq Cuffdiff Tuxedo Cufflinks Galaxy • 6.2k views
1
Entering edit mode
8.5 years ago
David Fredman ★ 1.1k

Part 1: I have trouble following this. It might help if you post a short snippet of the files that you believe are incorrectly formatted (and what you expected them to look like)

Part 2: Here's a small Perl script to subset your files by an XLOC id.

0
Entering edit mode

Dear David,

Thank you for your answer! Would you have an idea how to do something similar in R?

Below I paste a piece of a file that I referred to (hopefully it works). What I expect would be a short gene symbol in the gene_short_name (v5) column. Instead there are isoform ids there. This is a gene.fpkm_tracking file from CuffDiff output.

I also thought that gene_id column should probably contain gene ids from ensembl, or at least the isoform ids which are now in gene_short_name column... No idea what could go wrong, but it is annoying for downstream analyses, because all I get are the isoform numbers all the time...

Any suggestions are very welcome :) Monika

tracking_id     class_code     nearest_ref_id     gene_id     gene_short_name
XLOC_000001     -     -     XLOC_000001     ENST00000450305,ENST00000456328,ENST00000515242,ENST00000518655
XLOC_000002     -     -     XLOC_000002     ENST00000469289,ENST00000473358,ENST00000607096
XLOC_000003     -     -     XLOC_000003     ENST00000594647,ENST00000606857
XLOC_000004     -     -     XLOC_000004     ENST00000492842
XLOC_000005     -     -     XLOC_000005     ENST00000335137
XLOC_000006     -     -     XLOC_000006     ENST00000442987
XLOC_000007     -     -     XLOC_000007     ENST00000496488
XLOC_000008     -     -     XLOC_000008     ENST00000419160,ENST00000423728,ENST00000425496,ENST00000431321,ENST00000431812,ENST00000432964,ENST00000440038,ENST00000440163,ENST00000445840,ENST00000453935,ENST00000455207,ENST00000455464,ENST00000514436,ENST00000599771,ENST00000601486,ENST00000601814,ENST00000608420


0
Entering edit mode

UPDATE:

So, I found out a way to retrieve gene names based on XLOC_ids / from a gene set subsetted by other means, as described here: http://seqanswers.com/forums/showthread.php?t=18357 , with featureNames. Basically:

> featureNames(CuffGeneSet)


Still, instead of actual gene_short_name ids I obviously get all the ENST ids (as above), because this is what my data tables contain instead of the gene names...

Still I would be grateful if somebody had an idea why I have wrong data in that column!

1
Entering edit mode

You write that you obtained the gene reference set (Ensembl) from UCSC. Is the gtf/gff file formatted exactly as required by Cufflinks?

In particular, do you have gene_id or gene_name attributes in your gff2 or gff3 file?

Since you work in a common species (human) you could, rather than spend time trying to fix formatting, use the genome and gene reference files from this link. They have all the attributes that cufflinks requires for its various analyses (isoform, promoter etc).

0
Entering edit mode

Hm... I'm not sure, I will check the file, but it is likely that it doesn't have the right attributes then... I will check the cufflinks link you advise, that sounds like the best option! I need it already at the alignment step right?

Do you know how I could upload it directly to Galaxy?

Thank you very much, this helps a lot!

Best regards, Monika

1
Entering edit mode

It may be fine to keep the read mappings you've already done (presumably with tophat), as long as it could parse the exon/intron boundaries from the file you were using to improve read mapping at splice junctions.

If you are not concerned with finding novel transcripts, you could just redo the cuffdiff step with new reference files. Otherwise, you'll need to start over from the cufflinks step.

Not sure if these files are available directly in any of the galaxy instances. Check 'shared data' > 'data libraries' to see if its listed (as igenomes).

0
Entering edit mode

My primary interest is expression indeed, running only CuffDiff would save me a LOT of time.

I found a way to upload it, but it is uploading a zipped file, not sure if it will work... I will check shared data, great idea.

Thank you once more!

P.s. There is a file, but it is only ~100Mb... I will see if it helps anywhere :)

0
Entering edit mode

UPDATE: Running CuffDiff with the small .gtf file from shared data igenomes hg_19 worked, but:

1. I used it in place of merged.gtf (it was the only option, as I couldn't use it in Ref sequence... I think that option needs a fasta file, and so I use Locally cached there... else, I don't know what file to pass there, suggestions are more than welcome...)

2. I now see gene name in all columns: tracking_id, gene_id, gene_short_name... I think now I'm missing all the Cufflinks input... and I have far less (50% less) genes tested!

1
Entering edit mode

Yes, any novel isoforms and transcripts previously identified with cufflinks will not be present in the (Ensembl?) reference .gtf, of course. Cuffdiff will only estimate the (differential) abundance of the transcripts in the gtf provided. If you want the novel transcripts, you have to start from the cufflinks step with this new reference file.

0
Entering edit mode

Yes, I realized that too... I received a piece of advice from Galaxy and used the iGenomes hg_19 available in their Library (same as above I think), but starting from CuffMerge step. It seems to me (just by a quick look at the files) that when I add the iGenomes hg_19 as reference at CuffMerge step, then I preserve the novel transcripts as well (I see a number of genes with no gene_name (denoted as " - "), but I have their chromosomal locations... So it seems to have worked well.

If you think there might be some conflict though and re-running CuffLinks would be better, please let me know!

Thank you a lot for all the advice so far! Much appreciated! :)

Here is my discussion at Galaxy forum (more details): https://biostar.usegalaxy.org/p/8478/#8502

0
Entering edit mode

sounds just fine, as long as cuffmerge will keep track of the reference gtf attributes