What is in the UCSC/NCBI genomes please?
7.6 years ago
Aurelie MLB ▴ 360


I am looking at the human genome available through Bioconductor packages: NCBI GRCh38 and UCSC.hg19. And I do not get all the different sequence names I see. Could you help please?

In the UCSC.hg19, I do have chromosomes 1 to 22 + X and Y. But I also have chrM, chr1_gl000191_random, chr4_ctg9_hap1, chrUn_gl000212....and so on.

In the NCBI GRCh38, I can see sequences called MT, HSCHR1_CTG2_UNLOCALIZED, HSCHR3UN_CTG2, HSCHR2_RANDOM_CTG1....

What are those _random, _unlocalized, chrUn, _ctg9_hap1...please??

Should all those sequences be used when trying to align NGS reads to the genome for instance? or only a subset?

7.6 years ago

chrM == MT == mitochondrial DNA. You should probably use this.

The chrUn_* sequences are unplaced contigs. So they may belong in the genome, but we don't know where. I personally use these, but I know that's not universal.

chr*_unlocalized and chr??_*_random are contigs that are known to belong to a specific chromosome (I've never looked into how that was determined) but haven't yet been integrated in. You'll want to use these.

The various *hap* chromosomes are alternate haplotypes. There aren't a lot of good ways to deal with these during alignment yet, so a lot of people don't use these. An upcoming version of BWA is supposed to handle these in a good way, see if you plan to use BWA keep an eye out for that update and then definitely use these.

