Question: Is There Any Reference Exome ?
6
gravatar for Khader Shameer
8.0 years ago by
Manhattan, NY
Khader Shameer18k wrote:

I am currently working on the analysis of whole exome data. I need to align my reads with reference genome. I am wondering if there is any reference exome out there ? As I am using a SureSelect Human All Exon 50Mb Kits, I know what I am currently capturing and sequencing. If there is a reference exome, I could save significant computing time on alignment and further downstream analysis. If no such reference exome available, why it is not a good idea ? or What could be the pros and cons on generating one using GENCODE / CCDS / RefSeq ? Have you tried this before ?

Looking forward for your thoughts.

exome next-gen sequencing • 7.9k views
ADD COMMENTlink modified 8.0 years ago by Ryan Thompson3.4k • written 8.0 years ago by Khader Shameer18k
9
gravatar for Ryan Thompson
8.0 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

Of course a reference exome exists. The probes that you are using for exome capture were designed to be complementary to something. Generally, the target is some combination of RefSeq/CCDS/others.

However, as pointed out in other answers, you should map to the whole genome. In addition to the reasons given in other answers, consider that the capture process is not perfect, so you will effectively have a small sample of whole-genome sequencing data mixed in with your exome reads. You wouldn't want those "off-target" reads to be mapped on your exons, since they would likely have lots of mismatches and cause false-positive SNP calls and other havoc.

Also, there is not a well-defined boundary between "on-target" and "off-target" -- coverage tails off gradually as you get away from the probed regions: http://mneme.dyndns.org/~ryan/examples/neartarget.pdf

Anyway, the point is that you should really think about exome sequencing as whole-genome sequencing data that just happens to have deeper coverage near exons, because despite the capture process, there is a small but non-zero chance of getting a read from anywhere in the genome.

ADD COMMENTlink modified 8.0 years ago • written 8.0 years ago by Ryan Thompson3.4k

Thanks Ryan for your thoughtful comments !

ADD REPLYlink written 8.0 years ago by Khader Shameer18k

+1 for the last paragraph. Well said.

ADD REPLYlink written 8.0 years ago by Aaronquinlan10k

+1 for the last two paragraphs. Btw, the link to 'neartarget.pdf' is not valid anymore! 

ADD REPLYlink written 3.8 years ago by Dataman260
5
gravatar for Pierre Lindenbaum
8.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

Khader, I wonder if your idea is not the same (bad) idea that I had a few weeks ago: http://biostar.stackexchange.com/questions/4413 "Exome sequencing: masking the non-genic sequences ?"

ADD COMMENTlink written 8.0 years ago by Pierre Lindenbaum119k
2

agreed. using an adapted reference with the results of a real world experiment will only help to build spurious alignments that may not be real ones. even if the probe set is as perfect as it could, you will always end sequencing parts of the genome that you may not be interested in, and for that reason the reads' origin should not be inferred. I hadn't seen that previous question before, but it reaffirmed my beliefs.

ADD REPLYlink written 8.0 years ago by Jorge Amigo11k
1

no, it's not really the same question but it's the same conclusion.

ADD REPLYlink written 8.0 years ago by Pierre Lindenbaum119k

I think the idea is almost the same. I should have read through all exome questions here before posting this. Do you think we should close this question ?

ADD REPLYlink written 8.0 years ago by Khader Shameer18k

Thanks Pierre and Jorge. I am still not convinced about the aspect of spurious alignments. In protein sequence analysis, I do targeted alignments for the analysis of proteins domains. As I am not interested in unassigned regions, such targeted alignments were more accurate and helpful for my analysis. I think the same could apply in the case of exomes as well.

ADD REPLYlink written 8.0 years ago by Khader Shameer18k

Glad you didn't close this; this is generating a lot of interesting discussion even if the idea is ultimately considered not a good one.

ADD REPLYlink written 8.0 years ago by David Quigley11k

For proteins you care about homology. For exome, you care about SNPs/INDELs. Sequences aligned together must be related in the past, but you would not want to call SNPs. The two tasks are different. For exome sequencing, there are lots of off-target reads. You can never trust captures. If you do analyses, you will know.

ADD REPLYlink written 8.0 years ago by lh331k

For proteins you care about homology. For exome, you care about SNPs/INDELs. Sequences aligned together must be related in the past, but you would not want to call SNPs. The two tasks are different. For exome sequencing, there are lots of off-target reads. You can never trust captures. I am sure you get worse alignment when not mapping to the whole genome. If you do analyses, you will know.

ADD REPLYlink written 8.0 years ago by lh331k

@Heng: I agreed, the protein alignments were driven by homology and exome analysis is focusing on SNP/INDELS. My point is, I have experienced similar scenarios in protein domain alignments. Protein domain alignments are often affected by insertions/deletions. From my experience protein domain sequence vary to great extent within the seed dataset.

ADD REPLYlink written 8.0 years ago by Khader Shameer18k

My point is, they are very different scenarios. Experiences can hardly be transferred.

ADD REPLYlink written 8.0 years ago by lh331k

@Heng: Thanks for sharing your thoughts. I am currently using the whole genome assembly for the alignment.

ADD REPLYlink written 8.0 years ago by Khader Shameer18k

@Heng: Thanks for sharing your thoughts. I am currently using the whole genome assembly for the alignment. But, I disagree with your point on extrapolating protein alignment experience to exome alignments. Indels in exome regions will have a direct impact on the underlying proteome, so I still believe my idea is valid.

ADD REPLYlink written 8.0 years ago by Khader Shameer18k
3
gravatar for lh3
8.0 years ago by
lh331k
United States
lh331k wrote:

For mapping, always try to map to the complete genome. Mapping to the whole genome is actually faster than image analyses and base calling. If you can afford base calling, you must have the capacity to do whole genome mapping.

For the downstream analyses, you may consider to extract alignments overlapping the target regions (plus short flanking regions) only. This may make things faster and the output more convenient. You can use bedtools or the latest samtools. If you are calling SNPs with samtools, you may provide a BED file to call variants in target regions only.

There may be the argument of losing the data in off-target regions, but I have never seen people analyze those regions.

As to the exact target regions, Agilent must have given you a file describing all the target regions, or at least the positions of the probes?

ADD COMMENTlink written 8.0 years ago by lh331k
1

If you have bought the kit, Agilent should have told you the target regions in design. I know the regions for one of their kits, but I am not allowed to give it to others.

ADD REPLYlink written 8.0 years ago by lh331k

Thanks for your detailed reply. I heard from a collegue that probe/target data is available from Agilent - I am searching for a direct link. See: http://seqanswers.com/forums/showthread.php?t=5803 http://biostar.stackexchange.com/questions/5203/human-exome-capture-library-coordinates-download

ADD REPLYlink written 8.0 years ago by Khader Shameer18k

Yeah, I just got the target region from Agilent.

ADD REPLYlink written 8.0 years ago by Khader Shameer18k
2
gravatar for Chris Evelo
8.0 years ago by
Chris Evelo10.0k
Maastricht, The Netherlands
Chris Evelo10.0k wrote:

Well I think it is not necessarily a good idea because an exome is much less stable than a genome. It will be different between different cell types and at different moment.

That being said you could probably just take all exome data currently submitted to repositories and throw on an Amazon grid or so that is big enough and just calculate the average.

ADD COMMENTlink written 8.0 years ago by Chris Evelo10.0k
2

Ahh right.... I though you would not only want to have the exon sequence but also the average amount of expression as a reference. Must be my transcriptomics bias. But this is even nicer. Because you could do what I suggested and then for the sequence just couple it to the reference genome as well. Actually it would be interesting to know whether exome and genome yield the same average sequence.

ADD REPLYlink written 8.0 years ago by Chris Evelo10.0k
1

+1. The idea itself is interesting. But it is capture determined. Also, you will end up with >100Mb (if not much longer) target sequences this way.

ADD REPLYlink written 8.0 years ago by lh331k

Thanks Chris. Irrespective of expression dynamics in various tissues - the whole exome sequence will be a constant entity, so stability won't be an issue here.

ADD REPLYlink written 8.0 years ago by Khader Shameer18k

Thanks Chris. Yes, your second suggestion is interesting. But I would rather get an reference exome using the data used to create the preps. But here the general consensus is that it generating an exome may not be a good idea because it may result in spurious alignments.

ADD REPLYlink written 8.0 years ago by Khader Shameer18k
0
gravatar for Mary
8.0 years ago by
Mary11k
Boston MA area
Mary11k wrote:

I am always wary of subsets of the data, and what they really represent. But I'd be particularly wary with what is or is not an exon. When we had a discussion recently about constitutive exons it made me question this again.

I know that not all exons are represented in the human gene collections yet. And the recent modENCODE paper on the fly transcriptome only made it more apparent to me that we don't know what an exon is yet. They found 100,000+ exons, half of them new or revised. 23,000 new splice junctions. "Of the new alternative exons, 8,226 were previously annotated as constitutive."

If I was going to make a call on really expressed exons, I'd go all the way to the EST data to do that. And I also think that's missing a lot of spatial and temporal exons.

IMHO.

Fly paper: http://www.ncbi.nlm.nih.gov/pubmed/21179090

ADD COMMENTlink written 8.0 years ago by Mary11k
1

Perhaps you are thinking of RNA-seq? For exome sequencing, we use lots of probes to grab genomic regions. The probes are provided by the company and we have little control on this part.

ADD REPLYlink written 8.0 years ago by lh331k

Perhaps you are thinking of RNA-seq?

ADD REPLYlink written 8.0 years ago by lh331k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 808 users visited in the last hour