Question: Decoy In Reference Assembly
7
gravatar for Sangwoo Kim
6.5 years ago by
Sangwoo Kim390
UC San Diego
Sangwoo Kim390 wrote:

I am using 1000 Genomes data with my new project. When I am inspecting the reference assembly they have been using, I found it contains a "decoy" contig.

The 1000 Genomes FAQ says:

For the final round of alignments the sequence data will be mapped to a set of sequences derived from the GRCh37 assembly. This GRCh37-derived alignment set includes chromosomal plus unlocalized and unplaced contigs, the rCRS mitochondrial sequence (AC:NC_012920), Human herpesvirus 4 type 1 (AC:NC_007605) and decoy sequence derived from HuRef, Human Bac and Fosmid clones and NA12878. These files are available in phase2_reference_assembly_sequence on the ftp site. All human variant coordinates reported by the 1000 Genomes project are in GRCh37 coordinates.

Here, I have no idea what the decoy sequence is and why it is included. Maybe to detect sample contamination?

1000genomes • 9.8k views
ADD COMMENTlink modified 6.5 years ago by zam.iqbal.genome1.7k • written 6.5 years ago by Sangwoo Kim390
1

BTW, for more information, check the 1000g ftp.

ADD REPLYlink written 6.5 years ago by lh331k
27
gravatar for zam.iqbal.genome
6.5 years ago by
United Kingdom
zam.iqbal.genome1.7k wrote:

The reference genome is incomplete, particularly around the centromeres, so often reads which truly belong elsewhere are wrongly mapped to a particular place in the genome because the true match is missing from the reference. These cause false positive calls, which were bothering us in the 1000 Genomes Project. The decoy is a pragmatic solution to this - it contains known true human genome sequence that is not in the reference genome, and will "suck up" reads that would otherwise map with low quality in the reference. The decoy was built by Heng Li, at the Broad, working with Richard Durbin (Sanger) and Deanna Church of the Genome Reference Consortium (who maintain the reference genome).

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by zam.iqbal.genome1.7k
1

Thank you for the great answer. So basically, hs37d5 is GRCh37 + decoy. GRCh37 also have many small contigs (e.g. GL000xx.1) and a human herpesvirus (NC_007605) sequence. My understanding is that the goal of the decoy is same but it's only built artificially?

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Sangwoo Kim390
1

Yes, I guess the decoy has a similar goal to the herpes virus sequence, except that it is removing true human sequence just because the reference is incomplete.

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by zam.iqbal.genome1.7k
1

Thank you again!

ADD REPLYlink written 6.5 years ago by Sangwoo Kim390
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1892 users visited in the last hour