Question: UK Biobank l2r data holds segment means but no segment locations?
18 months ago by
United States, Irvine, University of California - Irvine
tohc0 wrote:

I've been looking at UK Biobank data and it seems the data holds the segment mean l2r (or log base 2) values for the Copy Number Variation but doesn't actually have the segment start and end positions. Each file is for a particular chromosome and contains all 500,000 patients but I was wondering if anyone knows where we might find the actual location on the chromosome the values correspond to.

Added the tags you mentioned

Well, in addition to Ram's comments, at which data are you looking, exactly? - you have provided no links. I can probably just contact the relevant person directly if you let me know from where you obtained your data.

Hey Kevin,

I can't exactly give you a link to the data itself. UK Biobank has a strict policy on how data is given out however this is the project website UK Biobank. A lot of the documentation seems to be talking about raw sequencing reads, however the inferred l2r CNV data is technically using these files to create the output files from my understanding.

This is the link to the actual instructions for data download Resource 664. The data we are using is the CNV log2r data however as you can read, the files downloaded are per chromosome. The issue is the files essentially only hold the log2r values but give no indication of which portion of the chromosome they are from, which is not very helpful.

Hope that clarifies things.

Thanks in advance!

Edit: I should also mention that segment means are the log2r values, I've been using them interchangeably.

16 months ago by
Eric T.2.6k
San Francisco, CA
Eric T.2.6k wrote:

My understanding is that the UK Biobank l2r files are the copy ratio estimates at each probe in the SNP array -- they have not been segmented in the publicly available dataset, so there are no segment breakpoints.

There are a couple of papers that survey CNVs; they used PennCNV on these input files to smooth the CNV signal and detect breakpoints. I'd retrieve the processed calls from those studies, rather than UKB; reprocessing would be incredibly expensive, and the original studies were done well.

I'm aware of efforts to call CNVs from the recently available whole-exome sequencing datasets as well. These aren't available for the full 500k cohort yet, but it's worth keeping an eye on these efforts, as the SNP arrays may not have used probes at some potentially important / likely CNV locations.

I see. That would make more sense I suppose. Our literature search also found PennCNV usage in a lot of papers. I'm assuming that each probe in the array should be "roughly" next to each other on a physical chromosome, however that seems like a rather big assumption.

For my own conceptual understanding l2r values in UK Biobank are essentially estimated CNVs for the SNPs in the array?

Either way thank you for your explanation!

