Question: Why is the hg38 exome so much bigger than hg19?
gravatar for ej
4 weeks ago by
European Union
ej60 wrote:


I downloaded the NCBI Refseq curated file of Genes and Gene Predictions from the UCSC Table Browser for hg38 as I want to use the exon coordinates as a target file for calling variants on Exome Sequencing data.

I noticed however, that the exon coordinates cover approximately double the genomic region as the exon coordinates in hg19 did (~80 million bps vs ~40 million). Is it possible that the size of the exome is really double in hg38?

I do not want to call variants on all of these regions since ~30% of these exonic regions are not covered at all in my WES data and another ~10% is covered by <10x. I would definitely like to exclude these regions from the target file but I do not fully understand what these regions are/why they were included in the first place.

Any help would be greatly appreciated.

target refseq hg38 exome • 102 views
ADD COMMENTlink modified 4 weeks ago by vkkodali2.0k • written 4 weeks ago by ej60

No, exons should not vary that much from freeze to freeze. But, more importantly, if this is really about exome coverage then use the bed file that came with your kit. If you are interested in coding variants then use CCDS.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Jeremy Leipzig19k
gravatar for vkkodali
4 weeks ago by
United States
vkkodali2.0k wrote:

I am not entirely sure why you are seeing such a huge difference in exome sizes between hg38 and hg19. Could you please describe in a little more detail how you are computing these values?

As far as RefSeq data are concerned, I strongly recommend you to download the relevant files from RefSeq and not UCSC. The data displayed in the UCSC browser are processed by the folks at UCSC and don't necessarily match RefSeq data exactly.

hg19 or GRCh37

RefSeq no longer actively annotates hg19 though updates are released occasionally. For the latest annotation data, go to NCBI Assembly and search for GRCh37. In the result page, click on the 'Download' button in the result card and choose RefSeq as source and GFF3 as your file format to download the latest version of annotation (released in September 2019) in GFF3 format.

hg38 or GRCh38

To download annotation for hg38 or GRCh38, go to NCBI Assembly and search for GRCh38. In the result page, click on the first hit to go to this page and use the blue 'Download Assemblies' button to download the RefSeq GFF3 file.

You will notice that RefSeq annotation data are provided in other file formats such as GTF and FASTA by following the same steps mentioned above.

ADD COMMENTlink written 4 weeks ago by vkkodali2.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 845 users visited in the last hour