Question

Obtaining LRR and BAF values from affymetrix SNP array

0

Entering edit mode

10.1 years ago

yogacacarolom ▴ 70

I am trying to extract the LRR and BAF values from an affymetrix SNP chip without success using linux based tools. I tried to use a small subset in windows designed software called "Axiom™ CNV Summary Tools Software" and it works perfectly. The problem is that I have a huge dataset and would be impossible to run in windows machine powerful enough.

Let's expose my steps until this point. First, I obtained five tab delimited files which are require to linux and/or windows pipeline (1-3 obtained with APT affymetrix software).

1 - The Axiom calls.txt or genotype file:

calls <- 'probeset_id    sample_1    sample_2    sample_3
    AX-100010998    2    2    2
    AX-100010999    1    0    1
    AX-100011005    0    1    2
    AX-100011007    2    2    1
    AX-100011008    1    1    2
    AX-100011010    2    2    2
    AX-100011011    0    1    0
    AX-100011012    0    1    0
    AX-100011016    0    0    1
    AX-100011017    0    0    2'
calls <- read.table(text=calls, header=T)

2 - The confidences.txt file:

conf<- 'probeset_id    sample_1    sample_2    sample_3
AX-100010998    0.00001    0.0002    0.00006
AX-100010999    0.00001    0.00001    0.00001
AX-100011005    0.00007    0.00017    0.00052
AX-100011007    0.00001    0.00001    0.00001
AX-100011008    0.001    0.00152    0.00001
AX-100011010    0.00001    0.00001    0.00002
AX-100011011    0.00004    0.00307    0.00002
AX-100011012    0.00001    0.00001    0.00001
AX-100011016    0.00003    0.00001    0.00001
AX-100011017    0.00003    0.01938    0.00032'
conf <- read.table(text=conf, header=T)

3 - The summary.txt file:

summ <- 'probeset_id    sample_1    sample_2    sample_3
AX-100010998-A    740.33229    655.41465    811.98053
AX-100010998-B    1139.25679    1659.55079    917.7128
AX-100010999-A    1285.67306    1739.03296    1083.48455
AX-100010999-B    1403.51265    341.85893    1237.48577
AX-100011005-A    1650.03408    1274.57594    485.5324
AX-100011005-B    430.3122    2674.70182    4070.90727
AX-100011007-A    411.28952    449.76345    2060.7136
AX-100011007-B    4506.77692    4107.12982    2065.58516
AX-100011008-A    427.78263    439.63541    333.86312
AX-100011008-B    1033.41335    1075.31617    1623.69271
AX-100011010-A    390.12996    350.54456    356.63156
AX-100011010-B    1183.29912    1256.01391    1650.82396
AX-100011011-A    3593.93578    2902.34079    2776.2503
AX-100011011-B    867.33447    2252.54552    961.31596
AX-100011012-A    2250.44699    1192.46116    1927.70581
AX-100011012-B    740.31957    1721.70283    662.1414
AX-100011016-A    1287.9221    1367.95468    1037.98191
AX-100011016-B    554.8795    666.93132    1487.2143
AX-100011017-A    2002.40468    1787.42982    490.28802
AX-100011017-B    849.92775    1025.44417    1429.96567'
summ <- read.table(text=summ, header=T)

4 - The gender.txt:

gender <- 'cel_files    gender
sample_1    female
sample_2    female
sample_3    female'
gender <- read.table(text=gender, header=F)

And finally the map file map.db in windows (non readable) or map.txt in linux as follows:

map <- 'Name    Chr    Position
AX-100010998    Z    70667736
AX-100010999    4    36427048
AX-100011005    26    4016045
AX-100011007    6    25439800
AX-100011008    2    147800617
AX-100011010    1    98919397
AX-100011011    Z    66652642
AX-100011012    7    28180218
AX-100011016    1A    33254907
AX-100011017    5    1918020'
map <- read.table(text=map, header=T)

This is my result in windows based result for sample_1:

Name    Chr    Position    sample_1.GType    sample_1.Log R Ratio    sample_1.B Allele Freq
AX-100010998    Z    70667736    BB    Infinity    0.675637419295063
AX-100010999    4    36427048    AB    0.101639462657534    0.531373516807123
AX-100011005    26    4016045    AA    -0.111910305454305    0
AX-100011007    6    25439800    BB    0.148781943283483    1
AX-100011008    2    147800617    AB    -0.293273363654622    0.609503132331127
AX-100011010    1    98919397    BB    -0.283993308525307    0.960031843823016
AX-100011011    Z    66652642    AA    Infinity    0.00579049667757003
AX-100011012    7    28180218    AA    0.0245684274744242    0.032174599843476
AX-100011016    1A    33254907    AA    -0.265925457515035    0
AX-100011017    5    1918020    AA    -0.0091211520536838    0

The values from the windows based tool seems to be correct, but in linux output that's is not the case. I am following the steps decribed at penncnv software (http://penncnv.openbioinformatics.org/en/latest/user-guide/input/) and I log2 transformed my summary.txt and did the quantile normalization with limma package using normalizeBetweenArrays(x), finishing with the corrsummary.txt:

corrsum <- 'probeset_id  sample_1  sample_2  sample_3
    AX-100010998-A  9.804932  9.285738  9.530882
    AX-100010998-B 10.249239 10.528922  9.804932
    AX-100010999-A 10.528922 10.641862 10.134816
    AX-100010999-B 10.641862  8.472829 10.249239
    AX-100011005-A 10.804446 10.249239  8.816931
    AX-100011005-B  8.835381 11.186266 12.045852
    AX-100011007-A  8.542343  8.835381 11.039756
    AX-100011007-B 12.045852 12.045852 11.186266
    AX-100011008-A  8.816931  8.816931  8.472829
    AX-100011008-B 10.134816  9.910173 10.592867
    AX-100011010-A  8.472829  8.542343  8.542343
    AX-100011010-B 10.374032 10.134816 10.641862
    AX-100011011-A 11.593784 11.593784 11.593784
    AX-100011011-B 10.012055 11.039756  9.910173
    AX-100011012-A 11.186266 10.012055 10.804446
    AX-100011012-B  9.530882 10.592867  9.285738
    AX-100011016-A 10.592867 10.374032 10.012055
    AX-100011016-B  9.285738  9.530882 10.528922
    AX-100011017-A 11.039756 10.804446  8.835381
    AX-100011017-B  9.910173  9.804932 10.374032'
corrsum <- read.table(text=corrsum, header=T)

Thus I applied:

./generate_affy_geno_cluster.pl calls.txt confidences.txt corrsummary.txt --locfile map.txt --sexfile gender.txt --output gencluster

and

./normalize_affy_geno_cluster.pl --locfile map.txt gencluster calls.txt --output lrrbaf.txt

And my linux based result (lrrbaf.txt) which must contain LRR and BAF information looks like that:

output <- 'Name    Chr     Position        sample_1.LogRRatio    sample_1.BAlleleFreq  sample_2.LogRRatio    sample_2.BAlleleFreq  sample_3.LogRRatio    sample_3.BAlleleFreq
    AX-100010999    4       36427048        -1952.0739      2       -1953.0739      2       -1952.0739      2
    AX-100011005    26      4016045 -2245.1784      2       -2244.1784      2       -2243.1784      2
    AX-100011007    6       25439800        -4433.4661      2       -4433.4661      2       -4434.4661      2
    AX-100011008    2       147800617       -1493.2287      2       -1493.2287      2       -1492.2287      2
    AX-100011011    Z       66652642        -4088.2311      2       -4087.2311      2       -4088.2311      2
    AX-100011012    7       28180218        -2741.2623      2       -2740.2623      2       -2741.2623      2
    AX-100011016    1A      33254907        -2117.7005      2       -2117.7005      2       -2116.7005      2
    AX-100011017    5       1918020 -3067.4077      2       -3067.4077      2       -3065.4077      2'

output <- read.table(text=output, header=T)

As showed above the linux result is completely different from windows based results (and make much less sense) and additionally do not contain the GType column in the output. Sorry to compose such a long question, but my intention was to make it as reproducible as possible. I would be grateful for any light to solve this problem as well any important remarks about this kind of data that I maybe forgot.

array r • 2.2k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.1 years ago by yogacacarolom ▴ 70