I've been looking at UK Biobank data and it seems the data holds the segment mean l2r (or log base 2) values for the Copy Number Variation but doesn't actually have the segment start and end positions. Each file is for a particular chromosome and contains all 500,000 patients but I was wondering if anyone knows where we might find the actual location on the chromosome the values correspond to.
My understanding is that the UK Biobank
l2r files are the copy ratio estimates at each probe in the SNP array -- they have not been segmented in the publicly available dataset, so there are no segment breakpoints.
There are a couple of papers that survey CNVs; they used PennCNV on these input files to smooth the CNV signal and detect breakpoints. I'd retrieve the processed calls from those studies, rather than UKB; reprocessing would be incredibly expensive, and the original studies were done well.
I'm aware of efforts to call CNVs from the recently available whole-exome sequencing datasets as well. These aren't available for the full 500k cohort yet, but it's worth keeping an eye on these efforts, as the SNP arrays may not have used probes at some potentially important / likely CNV locations.