Additional Data In Human Genome (Hg18 / Hg19) Assembly ?
2
17
Entering edit mode
10.5 years ago

While indexing hg18 and hg19 (UCSC), I noticed several additional chromosome headers are present apart from the default headers(chr1-22, M, X, Y). What are they ? Do I need to consider / remove them during the alignment with my whole exome reads ? What is your opinion on considering / removing them in the alignment step ?

hg18:

chr1_random chr2_random chr3_random chr4_random chr5_random chr6_random chr7_random chr8_random chr9_random chr10_random chr11_random chr13_random chr15_random chr16_random chr17_random chr18_random chr19_random chr21_random chr22_random chrX_random

hg19:

chr6_ssto_hap7 chr6_mcf_hap5 chr6_cox_hap2 chr6_mann_hap4 chr6_apd_hap1 chr6_qbl_hap6 chr6_dbb_hap3 chr17_ctg5_hap1 chr4_ctg9_hap1 chr1_gl000192_random chrUn_gl000225 chr4_gl000194_random chr4_gl000193_random chr9_gl000200_random chrUn_gl000222 chrUn_gl000212 chr7_gl000195_random chrUn_gl000223 chrUn_gl000224 chrUn_gl000219 chr17_gl000205_random chrUn_gl000215 chrUn_gl000216 chrUn_gl000217 chr9_gl000199_random chrUn_gl000211 chrUn_gl000213 chrUn_gl000220 chrUn_gl000218 chr19_gl000209_random chrUn_gl000221 chrUn_gl000214 chrUn_gl000228 chrUn_gl000227 chr1_gl000191_random chr19_gl000208_random chr9_gl000198_random chr17_gl000204_random chrUn_gl000233 chrUn_gl000237 chrUn_gl000230 chrUn_gl000242 chrUn_gl000243 chrUn_gl000241 chrUn_gl000236 chrUn_gl000240 chr17_gl000206_random chrUn_gl000232 chrUn_gl000234 chr11_gl000202_random chrUn_gl000238 chrUn_gl000244 chrUn_gl000248 chr8_gl000196_random chrUn_gl000249 chrUn_gl000246 chr17_gl000203_random chr8_gl000197_random chrUn_gl000245 chrUn_gl000247 chr9_gl000201_random chrUn_gl000235 chrUn_gl000239 chr21_gl000210_random chrUn_gl000231 chrUn_gl000229 chrUn_gl000226 chr18_gl000207_random

genome next-gen sequencing genome • 19k views
ADD COMMENT
11
Entering edit mode
10.5 years ago

from the UCSC FAQ: chrN_random tables: http://genome.ucsc.edu/FAQ/FAQdownloads#download10

Question:

"What are the chrN_random_[table] files in the human assembly? Why are they called random? Is there something biologically random about the sequence in these tables or are they just not placed within their given chromosomes?"

Response:

In the past, these tables contained data related to sequence that is known to be in a particular chromosome, but could not be reliably ordered within the current sequence.

Starting with the April 2003 human assembly, these tables also include data for sequence that is not in a finished state, but whose location in the chromosome is known, in addition to the unordered sequence. Because this sequence is not quite finished, it could not be included in the main "finished" ordered and oriented section of the chromosome.

Also, in a very few cases in the April 2003 assembly, the random files contain data related to sequence for alternative haplotypes. This is present primarily in chr6, where we have included two alternative versions of the MHC region in chr6_random. There are a few clones in other chromosomes that also correspond to a different haplotype. Because the primary reference sequence can only display a single haplotype, these alternatives were included in random files. In subsequent assemblies, these regions have been moved into separate files (e.g. chr6_hla_hap1).

ADD COMMENT
6
Entering edit mode

I would argue that one should include the *random chromosomes for alignment, as they will help to prevent misalignment owing to paralogy. This affects both exome capture and WGA.

ADD REPLY
3
Entering edit mode

Just to expand on what Aaron said (for others who stumble across this thread), if a read comes from one of these extra contigs, but you don't include it in your reference for alignment, you may find that the read then ends up mis-mapping at the next best match, which is often some similar sequence elsewhere in the genome. This is generally a bad thing

ADD REPLY
2
Entering edit mode

No, because my exome capture wasn't designed for those random chromosomes. But... maybe I should have consider to include them: Is There Any Reference Exome ?

ADD REPLY
0
Entering edit mode

Thanks Pierre. Have you consider them during indexing or alignment with your exome reads ?

ADD REPLY
0
Entering edit mode

Thanks Pierre, Aaron !

ADD REPLY
0
Entering edit mode

Just to expand on what Aaron said, if a read comes from one of these extra contigs, but you don't include it in your reference for alignment, you may find that the read then ends up mis-mapping at the next best match, which is often some similar sequence elsewhere in the genome. This is generally a bad thing.

ADD REPLY
0
Entering edit mode

Chris: That's a neat summary !

ADD REPLY
2
Entering edit mode
10.5 years ago
Suganthi ▴ 50

For HG19, the chromosomes pertaining to 6 and not labeled as random, are different haplotypes for the MHC region and I believe a similar situation exists for Chr17 ( though I am not sure what the alternate loci are). Please take a look at

http://vega.sanger.ac.uk/info/data/MHC_Homo_sapiens.html

http://genomeref.blogspot.com/

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/index.shtml

As to whether these regions should be included for alignment, perhaps yes, but it is bound to be complicated due to high similarity of regions.

ADD COMMENT
0
Entering edit mode

Thanks a lot Suganthi !

ADD REPLY

Login before adding your answer.

Traffic: 1925 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6