Question

What kind of Affymetrix SNP 6.0 datafile is this?

1

Entering edit mode

4.8 years ago

andorjkiss ▴ 40

Hi,

I'm new to microarray data formats and I'm wondering exactly what state this data is in (normalised?, CEL (txt version)? It's publically available data from the Osteosarcoma study from NCI TARGET. It's Affymetrix SNP 6.0 (human). TIA

Example here:

SampleName                           Chromosome Start   End         Num Probes  Segment Mean
TARGET-40-0A4HLD-01A-01D    chr1    61735   709822  36                   1.4796
TARGET-40-0A4HLD-01A-01D    chr1            709822  4114370 1155             2.72479
TARGET-40-0A4HLD-01A-01D    chr1            4114370 4408030 229                  3.78919
TARGET-40-0A4HLD-01A-01D    chr1            4408030 4417924 10                   2.2115
TARGET-40-0A4HLD-01A-01D    chr1            4417924 6229890 1540             3.62128
TARGET-40-0A4HLD-01A-01D    chr1            6229890 12860827 3674            3.09346

SNP 6.0 Affymetrix • 2.0k views

ADD COMMENT • link 4.8 years ago by andorjkiss ▴ 40

0

Entering edit mode

So this is un-normalised data, correct? And if I wanted normalised data (say L3 RMA-normalised) I would have to obtain the CEL/CDF files? Am I correct about this for normalisation? TIA

ADD REPLY • link 4.8 years ago by andorjkiss ▴ 40

0

Entering edit mode

Given that it is reporting the 'segment mean', and number of probes per segment, I'm assuming that the raw data has undergone circular binary segmentation (CBS), a popular algorithm for determining copy number,; therefore, it would be normalised and effectively ready for use. Can you give an exact source (link / URL) from where you got it?

ADD REPLY • link 4.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Sure, it's from the Osteosarcoma public data set at TARGET: https://ocg.cancer.gov/programs/target/data-matrix

ADD REPLY • link 4.8 years ago by andorjkiss ▴ 40

0

Entering edit mode

Thanks for the link. I can see ( here ) that they processed the copy number data using Partek Genomics Suite, but they do not provide much extra information. It can be assumed that it is definitely normalised, though. Basically, if you plot out that data in a karyoplot, you'll see gains and losses across the genome. One thing that you could do with it is overlap with known genes, too.

ADD REPLY • link 4.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Okay, thanks - this is very helpful. What we're ultimately trying to do is to use a programme called "InFlo" that makes use of datasets (RNA-Seq; WGS; WXS; SNP; Methylation) to build network maps. So, their input example file (https://github.com/VaradanLab/InFlo) is apparently L3 RMA Normalised (http://felixfan.github.io/RMA-Normalization-Microarray/) and looks like this:

"Gene_Name" "TCGA-24-1545-01" "TCGA-13-2060-01" "TCGA-24-1550-01" "A2BP1" 0.201615042245111 0.135470000274077 0.0185439399735216

So, my real question is, can I get to the RMA Normalised data from what NCI TARGET makes available, or do I need the CEL data and normalise it myself?

ADD REPLY • link 4.8 years ago by andorjkiss ▴ 40

0

Entering edit mode

So, you need expression data? Then you should be looking for the Affymetrix expression arrays, no? If the CEL files are there, then I would just process them myself via RMA, but not sure of your experience in that area?

ADD REPLY • link 4.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Okay, I've found the CEL files, the CHP files and studied how to do this via Bioconductor in R. I think that the RMA has been done and might already be in the CHP file(s). However, in the event that I need to normalise myself, I would have to choose an appropriate database file for the normalization. The documentation states that the library files needed are "GenomeWideSNP_6"; which I found here (http://bioconductor.org/packages/3.9/data/annotation/). Would this be the correct library file(s)?

ADD REPLY • link 4.7 years ago by andorjkiss ▴ 40

0

Entering edit mode

Well, the CEL files are the raw data files - they are produced by the camera device that scans the fluorescent intensities on the chips.

The CHP files are, if my memory serves me correctly, the genotype files produced by the Affymetrix Genotyping Console, so RMA will not have been used by these. They are most likely literally the genotype calls for the SNP probes, likely called by the Birdseed or Birdseed 2 algorithm.

You will probably have to start from the CEL file stage, and follow the guidance on the Aroma Affymetrix web-site for the SNP 6.0

ADD REPLY • link 4.7 years ago by Kevin Blighe 87k

score 2 · Answer 1 · 2019-07-19

That is copy number data. The Affymetrix SNP 6.0, despite its name, has >900,000 probes designed for copy number detection, in addition to the probes that can genotype SNPs. My PhD thesis (c.2012) was a large study in breast cancer that utilised this array type.

I've pulled this paragraph straight from my thesis:

The sixth generation of Affymetrix SNP arrays is the Genome-Wide Human SNP Array 6.0 (Affymetrix SNP 6.0). It genotypes much less than the ~10 million known SNPs but is still capable of building a large picture of variation (LaFramboise, 2009). The array chip is capable of genotyping 906,600 SNPs and contains an additional 945,826 probes for detecting CN. Of the CN probes, 115,000 target previously-know CNVs. Overall, markers are distributed evenly across the genome, with a median marker spacing of 1- 5Kb) (Affymetrix, 2008a).

One can also derive copy number from genotyping probes, too, by merely studying the fluorescent intensities. The 'Aroma' project worked a lot in copy number detection from most of the major Affymetrix chips ( see https://aroma-project.org/ ).

Kevin