I see a lot of references on the web to the fact that genomes may have been sequenced, the order of genes completely determined, yet the orientation of genes (the DNA strip they reside on) may not be available for all genes.
Can you help me understand how that is possible? I would get why we could hesitate between an orientation for ALL genes, since we have two possible starting points for reading a sequence, but how could we be sure of the orientation of only a few of them?
For example, classical genetics may have informed us as the orientation of a given gene and when that gene is on a contig with other genes, the orientation of the entire contig's gene content can correctly be inferred. If there is a contig, however, with no such "anchor," then gene orientation remains unknown even though we have sequence and gene order in hand.
Remember that genomes rarely come complete in the strict sense of the word.
It depends on how the gene structure was inferred. If there are any evidence of homology to previously known genes, then we can try to guess the orientation base on the known homologous gene.
But if it was assembled purely from sequencing reads where strandness was lost, then we really have no way of knowing what orientation it should be in.
Most standard 454, Illumina RNA-seq libraries are generated in such a way that the strand information of the mRNA is lost. Basically it means that the reads generated can be from both strand of the mRNA.
"...an orientation for ALL genes..." I think you should ask yourself, how do we know the orientation of ANY gene? DK is on the right track by saying that it depends on how the gene structure was inferred. If someone hands you a piece of DNA and says this is a gene sequence - but the strand is unknown, what would you do? (This happened to me recently - but it was with 19,000 sequences). There are a variety of things you can try to deduce orientation. You can try homology to known genes, but then you have to decide cutoffs. You can try translating the sequence from each strand and looking for stop codons, or an initiating methionine, this will help you partition the most likely coding strand for many, but not all. You can examine the codon bias of each strand. You can examine biases in the base position of codons for each strand (a Polish group applied this technique to yeast in the late 90's trying to decipher which of the 6000 or so predicted genes in the newly sequenced yeast genome were real). There are a variety of informatic consistency checks you can use to infer strand (orientation). You could (and should) even take it one step further and ask, how do we know any given gene really is a gene?
I think you won't really know the orientation of genes in a novel organism until you have several layers of evidence gathered from empirical observations (experiments). Strand specific sequencing protocols (as mentioned by DK) so that for any given RNA transcript produced from a locus you can determine which strand it came from. Orientation specific chromatin signatures (H3K4me3 at the 5' end of a gene). Perhaps even proteomic data telling you what peptides are produced by a given locus - and thus what orientation produced the protein (if it's coding). To answer your question, to be sure of the orientation of a few, you need evidence. You can infer something about ALL from a few, but certainty for any has to be built on evidence.
Given the genomic sequence of a gene, the strandedness is usually very easy to determine. For intronless genes (e.g. bacterial) the open reading frame is a strong giveaway. For genes with introns, the exon structure (and splice signals) usually suffice, and there are certain compositional biases (e.g. G+T skew) that give further evidence throughout the introns.
It would be informative to see a couple of such references from the literature to see what they precisely mean when stating that not all genes can be assigned an orientation. My guess is that the genomes in question were not fully "finished", but rather produced as a draft with many gaps in the sequence. Given a short enough contig (bounded by gaps), there may not be enough information to deduce the absolute orientation of the contig relative to the rest of the chromosome. (Similarly, the relative orientation of several contigs in a scaffold might be known, but the absolute orientation of the scaffold relative to the chromosome might not be.) Thus the absolute orientation of a gene might remain unknown, even when the strandedness within its contig is clear.