Question

Defining Your Reference Genome From Ucsc For Human Ngs Studies

9

Entering edit mode

12.9 years ago

Travis ★ 2.8k

Hi,

When creating a reference genome for human NGS studies, do people generally just use the major chromosomal contigs (chr1-22, chrX,Y,M) or do they also include the unplaced (chrUn) and random contigs (chrrandom*)?

I had initially assumed I should just go with the main contigs but now have begun to question my original reasoning.

next-gen sequencing genome reference • 5.3k views

ADD COMMENT • link updated 12.9 years ago by lh3 33k • written 12.9 years ago by Travis ★ 2.8k

0

Entering edit mode

Is there a particular reason you're questioning your original reasoning? Usually, I use chromosomal contigs for alignment, but now your question is leaving me wondering if I'm missing something...

ADD REPLY • link 12.9 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

I question my original decision because it was based on a whim and I noted that there are known polymorphisms associated with the unplaced and random contigs. Since these are sequences that do not map to any of the reference chromosomes, I believe it is probably best to include them in a reference genome whilst excluding the haplotype files.

ADD REPLY • link 12.9 years ago by Travis ★ 2.8k

score 10 · Answer 1 · 2011-06-02

The "random" contigs contain DNA that we know is in the genome, but that we're having trouble accurately placing into context. For alignment, at least, it's important to use these contigs. Here's why:

If you have reads that originated from a 'random' contig, but the 'random' contigs aren't in your reference sequence, it's quite likely that the read will be mapped elsewhere in the genome, albeit at a lower quality. Some of these reads are going to pass your quality filters incorrectly and if enough of them do, it can affect your SNP calling, copy-number assessment, etc.

So yeah, alignment should pretty much always be done against all the sequences.

Ram · Answer 2 · 2011-06-02

I think this might be helpful (the reference genome from 1000 Genomes project)

http://www.1000genomes.org/announcements/release-1000-genomes-main-project-reference-genome-2009-10-12

They say "Create a reference with chrs1-22, X, Y, NC_012920 MT, and include the non-chromosomal supercontigs."

But remember that if you are using BWA for mapping reads, your reference cannot be longer than 4Gb (otherwise BWA will silently fail).

score 5 · Answer 3 · 2011-06-02

5

Entering edit mode

12.9 years ago

lh3 33k

Here is a longer explanation I have just written. In short, include _random but exclude _alt.

ADD COMMENT • link 12.9 years ago by lh3 33k