Why does UCSC gap definitions disagree with the Ns in reference genome?
1
1
Entering edit mode
8.4 years ago
Duck ▴ 40

For example, in region chr2:33,142,137-33,142,249 in hg19, you can only see Ns. However, this region is not defined as a gap based on "Gap" track.

Similar thing is seen in hg38, where there are 752 N gaps on chromosomes chr1~chr22, chrX,chrY, but only 603 gaps defined in UCSC gap file.

How to explain the difference? Thanks.

UCSC-genome-browser Gap • 2.3k views
ADD COMMENT
0
Entering edit mode
8.4 years ago

When you say that there are 752 N gaps in hg38, how to you measure them? What is your definition?

Let's play with it a bit:

> library(AnnotationHub)
> library(Homo.sapiens)
> library(BSgenome.Hsapiens.UCSC.hg19)

> ahub = AnnotationHub()
> AnnotationHub::query(ahub, c('hg19', 'gap'))
> human.gaps = ahub[['AH6444']]
> human.nongaps = BiocGenerics::setdiff(GRanges(seqinfo(human.gaps)), human.gaps)
> alphabetFrequency(Views(BSgenome.Hsapiens.UCSC.hg19, human.gaps)) %>% head
     A C G T M R W S Y K V H D B        N - + .
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18000000 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0  3000000 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0   150000 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0   100000 0 0 0
[5,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0    50000 0 0 0
[6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0   150000 0 0 0

This shows that all the regions in the gaps track contains only N and no other nucleotides. So far so good, this is what we expected.

> summary(width(human.gaps))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
      47    40440    50000   524800   100000 30000000

The distribution of gaps sizes shows that on average a gap is 524,800 bases long, with a median of 50,000.

What is the distribution of lengths in your 752-603=149 missing gaps?

The example on chr2 you posted may be simply too short. As you can see from the distribution, most gaps are a few orders of magnitude larger. For example there is only one gap <500bp, and it is on a hap chromosome:

> subset(human.gaps, width(human.gaps) < 500)
UCSC track 'gap'
UCSCData object with 1 range and 0 metadata columns:
           seqnames             ranges strand
              <Rle>          <IRanges>  <Rle>
  [1] chr6_mcf_hap5 [4149580, 4149626]      *
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
ADD COMMENT
0
Entering edit mode

Hi,

Thank you for the example. I would like to know, if I can't those gaps as a bed file. I have couple bed files and my aim is to find overlaps across all the samples. But during this process, my overlapping regions come over to NNNNNNNNNs when I check them on IGV. Therefore, I would like to mask them out prior to visualisation.

Thank you for your help, (sorry for asking with a comment)

Tunc.

ADD REPLY
0
Entering edit mode

So what is the meaning of these 149 short missing gaps? Do they represent areas where the sequencing got interrupted?

ADD REPLY

Login before adding your answer.

Traffic: 2339 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6