Question: Defining Your Reference Genome From Ucsc For Human Ngs Studies
When creating a reference genome for human NGS studies, do people generally just use the major chromosomal contigs (chr1-22, chrX,Y,M) or do they also include the unplaced (chrUn) and random contigs (chrrandom*)?

I had initially assumed I should just go with the main contigs but now have begun to question my original reasoning.

Is there a particular reason you're questioning your original reasoning? Usually, I use chromosomal contigs for alignment, but now your question is leaving me wondering if I'm missing something...

I question my original decision because it was based on a whim and I noted that there are known polymorphisms associated with the unplaced and random contigs. Since these are sequences that do not map to any of the reference chromosomes, I believe it is probably best to include them in a reference genome whilst excluding the haplotype files.

The "random" contigs contain DNA that we know is in the genome, but that we're having trouble accurately placing into context. For alignment, at least, it's important to use these contigs. Here's why:

If you have reads that originated from a 'random' contig, but the 'random' contigs aren't in your reference sequence, it's quite likely that the read will be mapped elsewhere in the genome, albeit at a lower quality. Some of these reads are going to pass your quality filters incorrectly and if enough of them do, it can affect your SNP calling, copy-number assessment, etc.

So yeah, alignment should pretty much always be done against all the sequences.

I think this might be helpful (the reference genome from 1000 Genomes project)

They say "Create a reference with chrs1-22, X, Y, NC_012920 MT, and include the non-chromosomal supercontigs."

But remember that if you are using BWA for mapping reads, your reference cannot be longer than 4Gb (otherwise BWA will silently fail).

Here is a longer explanation I have just written. In short, include _random but exclude _alt.

