If I were use cufflinks in de novo mode to find transcripts or genes in my data that did not align to known transcripts from UCSC or Ensembl, I wouldn't know what to do downstream of this. How would one go about confirming that these are indeed novel? What sort of validation steps would one take (computational or non-computational), what in-depth information can I go looking for, what databases would have useful information for me? Thanks!
How decide if a transcript predicted by Cufflinks is 'novel'
A 'novel' transcript could simply be defined as one that has not been observed before. If the transcript structure you observe is not currently represented in RefSeq, Ensembl, or UCSC there is a good chance that it might be novel. You can view transcripts from each of these sources in IGV or the UCSC browser, etc.
If the putatitive novel transcript is an alternative isoform of a known gene, examine the structure of your novel transcript. Is there is a particular feature of the transcript that is distinctive (e.g. a novel exon, exon skipping event, intron retention). You can examine the complete corpus of mrnas and ESTs from GenBank for your species. You can view these in the UCSC browser, or download them in fasta format here: est.fa.gz and mrna.fa.gz (again using human as an example). If your putative novel transcript is not in a known gene region at all does it share any similarity to known transcripts?
Validation of novel transcripts
This will commonly involve some combination of RT-PCR, qPCR, cloning and Sanger sequencing. Does the predicted transcript sequence contain an ORF. Try feeding it into ORF finder for example. If not, does it have features of any known types of RNA gene? You could try folding it. Many RNA-folding tools exist already.
Functional validation will depend on what you find above...
Firstly, see my comment above regarding my personal opinion of how useful this would be for your situation. But, should you disagree (and when you do so and get a Nature paper because of it, rest assured that I will eat a sufficient amount of crow):
For a non-automated step, I would blast any hits to see if someone picked up something similar (some projects have found a HUGE number of random transcripts). Having said that, even if it's been seen before, that doesn't mean anyone has followed up on it. So, even if something isn't novel (strictly speaking) that doesn't mean it's not interesting for follow-up.
Look to see how conserved this region is in related species. If a region is unconserved, the odds are good that it's just noise (i.e., it may be transcribed, but it probably does nothing). I would strongly encourage you to place a good bit of emphasis on this in ranking candidates for follow-up. If a region is conserved, people will MUCH more readily believe that what you found is meaningful.
Does the transcript look like it might encode a protein (look for an open reading frame, etc.)? If so, does it have homology to anything?
Does any of the ENCODE data suggest that this might be a gene (PolII binding, histone modifications, etc.)? If the ENCODE data suggests that there might be transcription there but the region isn't conserved, I would follow my recommendations above for non-conserved regions.
Those are some initial non-wet bench things to do to get you started. Among the wet-bench follow-ups:
Northern blot/qPCR/whatever to look at tissue distribution (in your case, I guess also to look in non-xenograft samples).
RACE or some other method to try to asses the full length of the transcript.
Generate an antibody against it to see if it actually makes a protein (there are other ways of doing this, of course).
There are a number of other things one could do, mostly dependent upon whether the transcript is coding or not.