sequencing coverage information: Trying to understand the data format
1
0
Entering edit mode
21 months ago
mito ▴ 10

Hi,

I want to determine how well different regions of the genome are sequenced based on the GNomad data source. I have downloaded the following genome-wide tabular data with the first 10 rows shown here:

chrom  pos    mean        median  over_1      over_5      over_10      over_(and so on...)
1      12141  2.9005e-02  0       2.1939e-02  1.0547e-04  0.0000e+00
1      12142  2.9216e-02  0       2.1622e-02  1.0547e-04  0.0000e+00
1      12143  2.7951e-02  0       2.1200e-02  1.0547e-04  0.0000e+00
1      12144  2.9111e-02  0       2.1728e-02  1.0547e-04  0.0000e+00
1      12145  2.9216e-02  0       2.1833e-02  1.0547e-04  0.0000e+00
1      12146  2.6790e-02  0       2.0251e-02  1.0547e-04  0.0000e+00
1      12147  3.2802e-02  0       2.4048e-02  1.0547e-04  0.0000e+00
1      12148  3.3330e-02  0       2.4470e-02  1.0547e-04  0.0000e+00
1      12149  3.4279e-02  0       2.4786e-02  0.0000e+00  0.0000e+00

But I have not found out, what ...

  1. the coverage numbers actually mean. I am aware of this question. There appear to be different possible meanings for coverage beyond just the number of reads overlapping a region. Is there any way to determine which of the many meanings applies here?

  2. what the meaning of over_1, over_5, over_10 (and so on) is. Do these refer to the mean or median over neighboring n positions? And is it n positions on both sides or is it a centered window of n positions?

Is this a standard TSV-based data format or is it GNomad-specific?

coverage gnomad • 401 views
ADD COMMENT
2
Entering edit mode
21 months ago
mito ▴ 10

I figured it out. I did not notice the corresponding FAQ entry before and after a bit of thinking I came to the following conclusions:

  1. They used multiple samples, each with it's own coverage
  2. The coverage is the number of reads overlapping a specific position (which is an integer)
  3. The mean and median columns are based on the coverage across samples.
  4. The over_n columns refer to the fraction of samples which have a coverage of at least n

And it seems that there are a lot of samples where the coverage is close to zero.

ADD COMMENT

Login before adding your answer.

Traffic: 2602 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6