I am currently working on the analysis of whole exome data. I need to align my reads with reference genome. I am wondering if there is any reference exome out there? As I am using a SureSelect Human All Exon 50Mb Kits, I know what I am currently capturing and sequencing. If there is a reference exome, I could save significant computing time on alignment and further downstream analysis. If no such reference exome available, why it is not a good idea? or What could be the pros and cons on generating one using GENCODE / CCDS / RefSeq ? Have you tried this before?
Of course a reference exome exists. The probes that you are using for exome capture were designed to be complementary to something. Generally, the target is some combination of RefSeq/CCDS/others.
However, as pointed out in other answers, you should map to the whole genome. In addition to the reasons given in other answers, consider that the capture process is not perfect, so you will effectively have a small sample of whole-genome sequencing data mixed in with your exome reads. You wouldn't want those "off-target" reads to be mapped on your exons, since they would likely have lots of mismatches and cause false-positive SNP calls and other havoc.
Anyway, the point is that you should really think about exome sequencing as whole-genome sequencing data that just happens to have deeper coverage near exons, because despite the capture process, there is a small but non-zero chance of getting a read from anywhere in the genome.
For mapping, always try to map to the complete genome. Mapping to the whole genome is actually faster than image analyses and base calling. If you can afford base calling, you must have the capacity to do whole genome mapping.
For the downstream analyses, you may consider to extract alignments overlapping the target regions (plus short flanking regions) only. This may make things faster and the output more convenient. You can use bedtools or the latest samtools. If you are calling SNPs with samtools, you may provide a BED file to call variants in target regions only.
There may be the argument of losing the data in off-target regions, but I have never seen people analyze those regions.
As to the exact target regions, Agilent must have given you a file describing all the target regions, or at least the positions of the probes?
I am always wary of subsets of the data, and what they really represent. But I'd be particularly wary with what is or is not an exon. When we had a discussion recently about constitutive exons it made me question this again.
I know that not all exons are represented in the human gene collections yet. And the recent modENCODE paper on the fly transcriptome only made it more apparent to me that we don't know what an exon is yet. They found 100,000+ exons, half of them new or revised. 23,000 new splice junctions. "Of the new alternative exons, 8,226 were previously annotated as constitutive."
If I was going to make a call on really expressed exons, I'd go all the way to the EST data to do that. And I also think that's missing a lot of spatial and temporal exons.
Thanks Ryan for your thoughtful comments !
+1 for the last paragraph. Well said.
+1 for the last two paragraphs. Btw, the link to 'neartarget.pdf' is not valid anymore!