Question

How to read count information from an old RGB based Agilent DNA array

0

Entering edit mode

9 months ago

K.patel5 ▴ 140

Dear Biostars,

I am trying to prepare some published data to test a CNV filtration method I am working on. I would really like to use data from Conrad et al (2007), mostly because it is highly cited - and easy to access. Link here https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-142?query=E-MTAB-142.

Unfortunately, Aglient tech is a bit before my time and I am struggling to figure out how to read the count information. Here is a snippet of their data which should be easily loaded into R as a dataframe.

x <- structure(list(FEATURES = c("DATA", "DATA", "DATA"), FeatureNum = 6:8, 
    Row = c(1L, 1L, 1L), Col = 6:8, SubTypeMask = c(0L, 0L, 0L
    ), ControlType = c(0L, 0L, 0L), ProbeName = c("A_18_P17027306", 
    "chr1_165793426_165793473", "A_18_P14570373"), SystematicName = c("chr9:137150180-137150224", 
    "chr1:165793427-165793473", "chr3:198891339-198891384"), 
    LogRatio = c(0.08975880656, 0.1139920636, 0.1038222868), 
    LogRatioError = c(0.0619727653, 0.0625525488, 0.06214855983
    ), PValueLogRatio = c(0.1475167061, 0.06840327396, 0.09481057346
    ), gProcessedSignal = c(2550.198, 479.9035, 4755.878), rProcessedSignal = c(3135.688, 
    623.9445, 6040.224), gProcessedSigError = c(255.087, 48.34247, 
    475.6231), rProcessedSigError = c(313.597, 62.53483, 604.0367
    ), gMedianSignal = c(807.5, 188, 1464.5), rMedianSignal = c(1826, 
    405.5, 3546), gBGMedianSignal = c(38, 38, 38), rBGMedianSignal = c(43, 
    44, 44), gBGPixSDev = c(7.409993, 7.485912, 7.367351), rBGPixSDev = c(9.274448, 
    9.2318, 9.213135), gIsSaturated = c(0L, 0L, 0L), rIsSaturated = c(0L, 
    0L, 0L), gIsFeatNonUnifOL = c(0L, 0L, 0L), rIsFeatNonUnifOL = c(0L, 
    0L, 0L), gIsBGNonUnifOL = c(0L, 0L, 0L), rIsBGNonUnifOL = c(0L, 
    0L, 0L), gIsFeatPopnOL = c(0L, 0L, 0L), rIsFeatPopnOL = c(0L, 
    0L, 0L), gIsBGPopnOL = c(0L, 0L, 0L), rIsBGPopnOL = c(0L, 
    0L, 0L), IsManualFlag = c(0L, 0L, 0L), gBGSubSignal = c(772.789, 
    146.17, 1455.8), rBGSubSignal = c(1831.5, 366.544, 3568.53
    ), gIsPosAndSignif = c(1L, 1L, 1L), rIsPosAndSignif = c(1L, 
    1L, 1L), gIsWellAboveBG = c(1L, 1L, 1L), rIsWellAboveBG = c(1L, 
    1L, 1L), SpotExtentX = c(49.8279, 47.5395, 47.8731), gBGMeanSignal = c(37.7605, 
    37.804, 37.9131), rBGMeanSignal = c(43.431, 44.9921, 44.1818
    )), row.names = 6:8, class = "data.frame")

I am hoping to wrangle this data into something like a standard .BED file format for CNVs with the following column: Chromosome, Start, End, Type, Value.

The first three columns can be extracted from column 8 (SystematicName), but I am struggling to make sense on how I can ascertain the type (Deletion of Duplication), or Value (0, 1, 2, 3, 4, >4), as you would expect from modern CNV callers from WES/ WGS.

I assume the final few columns e.g. gBGMeanSignal and rBGMeanSignal might be valuable here as they seem to show normalised abundance values, but I am unsure weather to average them or add them together.

Any guidance would be most welcome. Also I see there is a p-value column - I assume it can be used to filter out values of low confidence?

Many Thanks, Krutik

CNV Agilent DNA WGS • 369 views

ADD COMMENT • link 9 months ago by K.patel5 ▴ 140