What is in the UCSC/NCBI genomes please?
Entering edit mode
7.6 years ago
Aurelie MLB ▴ 360


I am looking at the human genome available through Bioconductor packages: NCBI GRCh38 and UCSC.hg19. And I do not get all the different sequence names I see. Could you help please?

In the UCSC.hg19, I do have chromosomes 1 to 22 + X and Y. But I also have chrM, chr1_gl000191_random, chr4_ctg9_hap1, chrUn_gl000212....and so on.

In the NCBI GRCh38, I can see sequences called MT, HSCHR1_CTG2_UNLOCALIZED, HSCHR3UN_CTG2, HSCHR2_RANDOM_CTG1....

What are those _random, _unlocalized, chrUn, _ctg9_hap1...please??

Should all those sequences be used when trying to align NGS reads to the genome for instance? or only a subset?

many thanks


genome • 1.5k views
Entering edit mode
7.6 years ago

chrM == MT == mitochondrial DNA. You should probably use this.

The chrUn_* sequences are unplaced contigs. So they may belong in the genome, but we don't know where. I personally use these, but I know that's not universal.

chr*_unlocalized and chr??_*_random are contigs that are known to belong to a specific chromosome (I've never looked into how that was determined) but haven't yet been integrated in. You'll want to use these.

The various *hap* chromosomes are alternate haplotypes. There aren't a lot of good ways to deal with these during alignment yet, so a lot of people don't use these. An upcoming version of BWA is supposed to handle these in a good way, see if you plan to use BWA keep an eye out for that update and then definitely use these.

Entering edit mode

Thanks a lot !!!


Login before adding your answer.

Traffic: 869 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6