Question: Gene_short_name missing in CuffDiff output (or how to retrieve it?)
0
gravatar for nevev
4.7 years ago by
nevev90
Germany
nevev90 wrote:

Dear All,

I ran Tuxedo at Galaxy, using TopHat2-Cufflinks-Cuffmerge-Cuffdiff. I expected my Cuffdiff output to contain gene_name, so that I could directly identify genes in downstream analyses. However, it seems to be missing and I only have a list of transcript ids (all isoforms) for each gene instead.

I used Reference genome at all steps (Cufflinks, Cuffdiff) downloaded from UCSC with Ensembl annotations. Now when I e.g. open a file with gene fpkm tracking, my columns tracking_id and gene_id are the same and contain XLOC ids. The column with gene_short_name contains a list of Ensembl transcript ids (although it's a gene file, it just puts all transcript ids belonging to that gene there).

So to me it looks like the columns are not filled appropriately. I wondered if somebody knows what I might have done wrong or has encountered a similar problem.

Else, I have been looking for a way to "fish-out" only the Ensembl ids I need based on a list of XLOC ids that I am interested in (e.g. a subset of 3000, or a 1000). (So, to based on e.g. a 1000 of XLOC ids subset only that 1000 of rows from the file containing all of them (i.e. to search out these rows and then assemble a data table with these rows only; they are not subsequent rows in the original file)). Any suggestions are very welcome :)

Best regards,

Monika

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by nevev90
1
gravatar for David Fredman
4.7 years ago by
David Fredman980
University of Bergen, Norway
David Fredman980 wrote:

Part 1: I have trouble following this. It might help if you post a short snippet of the files that you believe are incorrectly formatted (and what you expected them to look like)

Part 2: Here's a small Perl script to subset your files by an XLOC id.

 

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by David Fredman980
0
gravatar for nevev
4.7 years ago by
nevev90
Germany
nevev90 wrote:

Dear David,

Thank you for your answer! Would you have an idea how to do something similar in R?

Below I paste a piece of a file that I referred to (hopefully it works). What I expect would be a short gene symbol in the gene_short_name (v5) column. Instead there are isoform ids there. This is a gene.fpkm_tracking file from CuffDiff output.

I also thought that gene_id column should probably contain gene ids from ensembl, or at least the isoform ids which are now in gene_short_name column... No idea what could go wrong, but it is annoying for downstream analyses, because all I get are the isoform numbers all the time...

Any suggestions are very welcome :) Monika

tracking_id class_code nearest_ref_id gene_id gene_short_name
XLOC_000001 - - XLOC_000001 ENST00000450305,ENST00000456328,ENST00000515242,ENST00000518655
XLOC_000002 - - XLOC_000002 ENST00000469289,ENST00000473358,ENST00000607096
XLOC_000003 - - XLOC_000003 ENST00000594647,ENST00000606857
XLOC_000004 - - XLOC_000004 ENST00000492842
XLOC_000005 - - XLOC_000005 ENST00000335137
XLOC_000006 - - XLOC_000006 ENST00000442987
XLOC_000007 - - XLOC_000007 ENST00000496488
XLOC_000008 - - XLOC_000008 ENST00000419160,ENST00000423728,ENST00000425496,ENST00000431321,ENST00000431812,ENST00000432964,ENST00000440038,ENST00000440163,ENST00000445840,ENST00000453935,ENST00000455207,ENST00000455464,ENST00000514436,ENST00000599771,ENST00000601486,ENST00000601814,ENST00000608420

 

 

 

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by nevev90
0
gravatar for nevev
4.7 years ago by
nevev90
Germany
nevev90 wrote:

UPDATE:

 

So, I found out a way to retrieve gene names based on XLOC_ids / from a gene set subsetted by other means, as described here: http://seqanswers.com/forums/showthread.php?t=18357 , with featureNames. Basically:

> featureNames(CuffGeneSet)

Still, instead of actual gene_short_name ids I obviously get all the ENST ids (as above), because this is what my data tables contain instead of the gene names...

Still I would be grateful if somebody had an idea why I have wrong data in that column!

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by nevev90
1

You write that you obtained the gene reference set (Ensembl) from UCSC. Is the gtf/gff file formatted exactly as required by Cufflinks http://cufflinks.cbcb.umd.edu/gff.html ?

In particular, do you have gene_id or gene_name attributes in your gff2 or gff3 file?

Since you work in a common species (human) you could, rather than spend time trying to fix formatting, use the genome and gene reference files from http://cufflinks.cbcb.umd.edu/igenomes.html . They have all the attributes that cufflinks requires for its various analyses (isoform, promoter etc).

ADD REPLYlink written 4.7 years ago by David Fredman980

Hm... I'm not sure, I will check the file, but it is likely that it doesn't have the right attributes then... I will check the cufflinks link you advise, that sounds like the best option! I need it already at the alignment step right?

Do you know how I could upload it directly to Galaxy?

Thank you very much, this helps a lot! 

Best regards, Monika

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by nevev90
1

It may be fine to keep the read mappings you've already done (presumably with tophat), as long as it could parse the exon/intron boundaries from the file you were using to improve read mapping at splice junctions. 

If you are not concerned with finding novel transcripts, you could just redo the cuffdiff step with new reference files. Otherwise, you'll need to start over from the cufflinks step.
Not sure if these files are available directly in any of the galaxy instances. Check 'shared data' > 'data libraries' to see if its listed (as igenomes).

ADD REPLYlink written 4.7 years ago by David Fredman980

My primary interest is expression indeed, running only CuffDiff would save me a LOT of time.

I found a way to upload it, but it is uploading a zipped file, not sure if it will work... I will check shared data, great idea.

Thank you once more!

P.s. There is a file, but it is only ~100Mb... I will see if it helps anywhere :)

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by nevev90

UPDATE: Running CuffDiff with the small .gtf file from shared data igenomes hg_19 worked, but:

1. I used it in place of merged.gtf (it was the only option, as I couldn't use it in Ref sequence... I think that option needs a fasta file, and so I use Locally cached there... else, I don't know what file to pass there, suggestions are more than welcome...)

2. I now see gene name in all columns: tracking_id, gene_id, gene_short_name...  I think now I'm missing all the Cufflinks input... and I have far less (50% less) genes tested!

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by nevev90
1

Yes, any novel isoforms and transcripts previously identified with cufflinks will not be present in the (Ensembl?) reference .gtf, of course. Cuffdiff will only estimate the (differential) abundance of the transcripts in the gtf provided. If you want the novel transcripts, you have to start from the cufflinks step with this new reference file.

ADD REPLYlink written 4.6 years ago by David Fredman980

Yes, I realized that too... I received a piece of advice from Galaxy and used the iGenomes hg_19 available in their Library (same as above I think), but starting from CuffMerge step. It seems to me (just by a quick look at the files) that when I add the iGenomes hg_19 as reference at CuffMerge step, then I preserve the novel transcripts as well (I see a number of genes with no gene_name (denoted as " - "), but I have their chromosomal locations... So it seems to have worked well.

If you think there might be some conflict though and re-running CuffLinks would be better, please let me know!

Thank you a lot for all the advice so far! Much appreciated! :)

Here is my discussion at Galaxy forum (more details): https://biostar.usegalaxy.org/p/8478/#8502 .

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by nevev90

sounds just fine, as long as cuffmerge will keep track of the reference gtf attributes

ADD REPLYlink written 4.6 years ago by David Fredman980
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1866 users visited in the last hour