No Sequencing Data at Low Positions
4.7 years ago
ccnn ▴ 20

Opening a bam file for just chr22, I was surprised to see that there were no reads aligned until around position 16,050,000. In UCSC Genome Browser, looking at a window of chr22:16,049,420-16,050,420, I can see that there's "nothing," but then different tracks start. In chr1, I also think I remember seeing that alignments started only at 10,000.

Why is there nothing earlier in the chromosome? Do those positions not correspond to DNA? I've downloaded data on the length/end position of each chromosome; can I find a list of these "start" positions?

Are you looking at the right genome build in UCSC? Also are you sure the data is aligned against UCSC genome build (which have a chr prefix for chromosomes as opposed to other builds which may only have numbers.

Beginnings of the chromosome sequence may only have N's since the ends of chromosomes are hard to sequence.

22 is only 51304566 bp long. So the first ~16M bases is almost a third of it

Ah it does indeed like everything before that is "N" when I zoom into "base" on the browser.

So those N-nucleotides means that "we know there are nucleotides there, we are just not sure what they are"

Got it. Thank you! So is there somewhere I can find out how many of the first bases of each chromosome are N?

Hello,

you could use your language of choice to find the first position in each reference sequence which is not an N.

