Question: Defining Your Reference Genome From Ucsc For Human Ngs Studies
gravatar for Travis
7.8 years ago by
Travis2.8k wrote:


When creating a reference genome for human NGS studies, do people generally just use the major chromosomal contigs (chr1-22, chrX,Y,M) or do they also include the unplaced (chrUn) and random contigs (chrrandom*)?

I had initially assumed I should just go with the main contigs but now have begun to question my original reasoning.

ADD COMMENTlink written 7.8 years ago by Travis2.8k

Is there a particular reason you're questioning your original reasoning? Usually, I use chromosomal contigs for alignment, but now your question is leaving me wondering if I'm missing something...

ADD REPLYlink written 7.8 years ago by Mitch Bekritsky1.1k

I question my original decision because it was based on a whim and I noted that there are known polymorphisms associated with the unplaced and random contigs. Since these are sequences that do not map to any of the reference chromosomes, I believe it is probably best to include them in a reference genome whilst excluding the haplotype files.

ADD REPLYlink written 7.8 years ago by Travis2.8k
gravatar for Chris Miller
7.8 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

The "random" contigs contain DNA that we know is in the genome, but that we're having trouble accurately placing into context. For alignment, at least, it's important to use these contigs. Here's why:

If you have reads that originated from a 'random' contig, but the 'random' contigs aren't in your reference sequence, it's quite likely that the read will be mapped elsewhere in the genome, albeit at a lower quality. Some of these reads are going to pass your quality filters incorrectly and if enough of them do, it can affect your SNP calling, copy-number assessment, etc.

So yeah, alignment should pretty much always be done against all the sequences.

ADD COMMENTlink written 7.8 years ago by Chris Miller20k
gravatar for Pablo
7.8 years ago by
Pablo1.9k wrote:

I think this might be helpful (the reference genome from 1000 Genomes project)

They say "Create a reference with chrs1-22, X, Y, NC_012920 MT, and include the non-chromosomal supercontigs."

But remember that if you are using BWA for mapping reads, your reference cannot be longer than 4Gb (otherwise BWA will silently fail).

ADD COMMENTlink written 7.8 years ago by Pablo1.9k
gravatar for lh3
7.8 years ago by
United States
lh331k wrote:

Here is a longer explanation I have just written. In short, include _random but exclude _alt.

ADD COMMENTlink written 7.8 years ago by lh331k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1028 users visited in the last hour