I am trying to extract the LRR and BAF values from an affymetrix SNP chip without success using linux based tools. I tried to use a small subset in windows designed software called "Axiomâ„¢ CNV Summary Tools Software" and it works perfectly. The problem is that I have a huge dataset and would be impossible to run in windows machine powerful enough.
Let's expose my steps until this point. First, I obtained five tab delimited files which are require to linux and/or windows pipeline (1-3 obtained with APT affymetrix software).
1 - The Axiom calls.txt or genotype file:
calls <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998 2 2 2
AX-100010999 1 0 1
AX-100011005 0 1 2
AX-100011007 2 2 1
AX-100011008 1 1 2
AX-100011010 2 2 2
AX-100011011 0 1 0
AX-100011012 0 1 0
AX-100011016 0 0 1
AX-100011017 0 0 2'
calls <- read.table(text=calls, header=T)
2 - The confidences.txt file:
conf<- 'probeset_id sample_1 sample_2 sample_3
AX-100010998 0.00001 0.0002 0.00006
AX-100010999 0.00001 0.00001 0.00001
AX-100011005 0.00007 0.00017 0.00052
AX-100011007 0.00001 0.00001 0.00001
AX-100011008 0.001 0.00152 0.00001
AX-100011010 0.00001 0.00001 0.00002
AX-100011011 0.00004 0.00307 0.00002
AX-100011012 0.00001 0.00001 0.00001
AX-100011016 0.00003 0.00001 0.00001
AX-100011017 0.00003 0.01938 0.00032'
conf <- read.table(text=conf, header=T)
3 - The summary.txt file:
summ <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998-A 740.33229 655.41465 811.98053
AX-100010998-B 1139.25679 1659.55079 917.7128
AX-100010999-A 1285.67306 1739.03296 1083.48455
AX-100010999-B 1403.51265 341.85893 1237.48577
AX-100011005-A 1650.03408 1274.57594 485.5324
AX-100011005-B 430.3122 2674.70182 4070.90727
AX-100011007-A 411.28952 449.76345 2060.7136
AX-100011007-B 4506.77692 4107.12982 2065.58516
AX-100011008-A 427.78263 439.63541 333.86312
AX-100011008-B 1033.41335 1075.31617 1623.69271
AX-100011010-A 390.12996 350.54456 356.63156
AX-100011010-B 1183.29912 1256.01391 1650.82396
AX-100011011-A 3593.93578 2902.34079 2776.2503
AX-100011011-B 867.33447 2252.54552 961.31596
AX-100011012-A 2250.44699 1192.46116 1927.70581
AX-100011012-B 740.31957 1721.70283 662.1414
AX-100011016-A 1287.9221 1367.95468 1037.98191
AX-100011016-B 554.8795 666.93132 1487.2143
AX-100011017-A 2002.40468 1787.42982 490.28802
AX-100011017-B 849.92775 1025.44417 1429.96567'
summ <- read.table(text=summ, header=T)
4 - The gender.txt:
gender <- 'cel_files gender
sample_1 female
sample_2 female
sample_3 female'
gender <- read.table(text=gender, header=F)
And finally the map file map.db in windows (non readable) or map.txt in linux as follows:
map <- 'Name Chr Position
AX-100010998 Z 70667736
AX-100010999 4 36427048
AX-100011005 26 4016045
AX-100011007 6 25439800
AX-100011008 2 147800617
AX-100011010 1 98919397
AX-100011011 Z 66652642
AX-100011012 7 28180218
AX-100011016 1A 33254907
AX-100011017 5 1918020'
map <- read.table(text=map, header=T)
This is my result in windows based result for sample_1:
Name Chr Position sample_1.GType sample_1.Log R Ratio sample_1.B Allele Freq
AX-100010998 Z 70667736 BB Infinity 0.675637419295063
AX-100010999 4 36427048 AB 0.101639462657534 0.531373516807123
AX-100011005 26 4016045 AA -0.111910305454305 0
AX-100011007 6 25439800 BB 0.148781943283483 1
AX-100011008 2 147800617 AB -0.293273363654622 0.609503132331127
AX-100011010 1 98919397 BB -0.283993308525307 0.960031843823016
AX-100011011 Z 66652642 AA Infinity 0.00579049667757003
AX-100011012 7 28180218 AA 0.0245684274744242 0.032174599843476
AX-100011016 1A 33254907 AA -0.265925457515035 0
AX-100011017 5 1918020 AA -0.0091211520536838 0
The values from the windows based tool seems to be correct, but in linux output that's is not the case. I am following the steps decribed at penncnv software (http://penncnv.openbioinformatics.org/en/latest/user-guide/input/) and I log2 transformed my summary.txt and did the quantile normalization with limma package using normalizeBetweenArrays(x), finishing with the corrsummary.txt:
corrsum <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998-A 9.804932 9.285738 9.530882
AX-100010998-B 10.249239 10.528922 9.804932
AX-100010999-A 10.528922 10.641862 10.134816
AX-100010999-B 10.641862 8.472829 10.249239
AX-100011005-A 10.804446 10.249239 8.816931
AX-100011005-B 8.835381 11.186266 12.045852
AX-100011007-A 8.542343 8.835381 11.039756
AX-100011007-B 12.045852 12.045852 11.186266
AX-100011008-A 8.816931 8.816931 8.472829
AX-100011008-B 10.134816 9.910173 10.592867
AX-100011010-A 8.472829 8.542343 8.542343
AX-100011010-B 10.374032 10.134816 10.641862
AX-100011011-A 11.593784 11.593784 11.593784
AX-100011011-B 10.012055 11.039756 9.910173
AX-100011012-A 11.186266 10.012055 10.804446
AX-100011012-B 9.530882 10.592867 9.285738
AX-100011016-A 10.592867 10.374032 10.012055
AX-100011016-B 9.285738 9.530882 10.528922
AX-100011017-A 11.039756 10.804446 8.835381
AX-100011017-B 9.910173 9.804932 10.374032'
corrsum <- read.table(text=corrsum, header=T)
Thus I applied:
./generate_affy_geno_cluster.pl calls.txt confidences.txt corrsummary.txt --locfile map.txt --sexfile gender.txt --output gencluster
and
./normalize_affy_geno_cluster.pl --locfile map.txt gencluster calls.txt --output lrrbaf.txt
And my linux based result (lrrbaf.txt) which must contain LRR and BAF information looks like that:
output <- 'Name Chr Position sample_1.LogRRatio sample_1.BAlleleFreq sample_2.LogRRatio sample_2.BAlleleFreq sample_3.LogRRatio sample_3.BAlleleFreq
AX-100010999 4 36427048 -1952.0739 2 -1953.0739 2 -1952.0739 2
AX-100011005 26 4016045 -2245.1784 2 -2244.1784 2 -2243.1784 2
AX-100011007 6 25439800 -4433.4661 2 -4433.4661 2 -4434.4661 2
AX-100011008 2 147800617 -1493.2287 2 -1493.2287 2 -1492.2287 2
AX-100011011 Z 66652642 -4088.2311 2 -4087.2311 2 -4088.2311 2
AX-100011012 7 28180218 -2741.2623 2 -2740.2623 2 -2741.2623 2
AX-100011016 1A 33254907 -2117.7005 2 -2117.7005 2 -2116.7005 2
AX-100011017 5 1918020 -3067.4077 2 -3067.4077 2 -3065.4077 2'
output <- read.table(text=output, header=T)
As showed above the linux result is completely different from windows based results (and make much less sense) and additionally do not contain the GType column in the output. Sorry to compose such a long question, but my intention was to make it as reproducible as possible. I would be grateful for any light to solve this problem as well any important remarks about this kind of data that I maybe forgot.