parsing Affymetrix file
1
1
Entering edit mode
5.4 years ago

I need some help understanding the structure of a variant call file, with Affymetrix data

I have a RAW data file in the following format:

probeset_id     CEL_call_code   chromosome  position    rsid
AFFX-SP-000001  CC              10          121336954   rs10466213
AFFX-SP-000002  CG              12          23048418    rs10770943
AFFX-SP-000004  GG              17          56334747    rs11079221
AFFX-SP-000005  GG              11          85910686    rs12285109
AFFX-SP-000006  CG              15          60865412    rs12913890

The problem I have is that some identical SNPs have different call codes, and in some situation they are in same position or different position, like for these two examples:

Same position with different CEL call code

probeset_id     CEL_call_code   chromosome  position    rsid
AX-96108113     AC              4           6301295     rs1801214
AX-96108115     TC              4           6301295     rs1801214

Different position with different CEL call code

probeset_id     CEL_call_code   chromosome  position    rsid
AX-123355923    CACA            7           117642463   rs121908784
AX-96064890     AA              7           117642464   rs121908784

How is this possible, and how do I know which one is the CORRECT CEL call code for these SNPs which are multiple times in the same file.

Thank you, any suggestion would be very much appreciated.

SNP sequence Assembly • 2.1k views
ADD COMMENT
0
Entering edit mode
5.4 years ago

Which array version is this? The probes with AFFX prefix are control probes. You do not really have to worry about them - they will be used during normalisation. See my answer, here (more relating to Affymetrix expression arrays): A: Control probe sets in Affymetrix ST microarrays

For the others, it makes sense that there are multiple probes targeting each position where there is a rs ID because, by their very nature, SNP genotypes vary among populations. So, the probes have to be designed to account for the different possible genotypes at each position. If the individual is heterozygous for the SNP, both probes should bind and return signal; if homozygous, only one should bind, while the other's signal remains in the background 'noise'.

I write more here: A: Genotyping, genotype calling or SNP calling?

ADD COMMENT
0
Entering edit mode

How do I found out which array version it is?

What I need to do is to make a script to extract the corect SNP and their genotype. Because same SNPs have different call code I don't know how I can chose which one is the corect one.

You mentioned normalization, but this file shouldn't be already normalized?

ADD REPLY
0
Entering edit mode

Well, from where did you obtain this data? The array version is likely stored in the CEL file header information, which may or may not be accessible.What is the ultimate aim of your work?

ADD REPLY
0
Entering edit mode

I need to transform the file from this format

probeset_id     CEL_call_code   chromosome  position    rsid
AFFX-SP-000001  CC              10          121336954   rs10466213
AFFX-SP-000002  CG              12          23048418    rs10770943
AFFX-SP-000004  GG              17          56334747    rs11079221
AFFX-SP-000005  GG              11          85910686    rs12285109
AFFX-SP-000006  CG              15          60865412    rs12913890

into soemthing like

RSID           Chromosome       Position     Genotype
rs12913890     15               85910686     GG
rs12285109     11               60865412     CG
ADD REPLY
0
Entering edit mode

You can do that in Shell scripting using cut or awk:

cat test
probeset_id CEL_call_code   chromosome  position    rsid
AFFX-SP-000001  CC  10  121336954   rs10466213
AFFX-SP-000002  CG  12  23048418    rs10770943
AFFX-SP-000004  GG  17  56334747    rs11079221
AFFX-SP-000005  GG  11  85910686    rs12285109
AFFX-SP-000006  CG  15  60865412    rs12913890

awk '{print $5"\t"$3"\t"$4"\t"$2}' test 
rsid    chromosome  position    CEL_call_code
rs10466213  10  121336954   CC
rs10770943  12  23048418    CG
rs11079221  17  56334747    GG
rs12285109  11  85910686    GG
rs12913890  15  60865412    CG
ADD REPLY
0
Entering edit mode

The problem I have is that some SNPs in the same files on different rows have different genotype so I don;t know which one is the correct Genotype

probeset_id     CEL_call_code   chromosome  position    rsid
AX-96108113     AC              4           6301295     rs1801214
AX-96108115     TC              4           6301295     rs1801214

The above example is a real example, the RS1801214 on one line has T/C and on the other line has A/C. From what I know one person cannot have one SNP with different values, which means that one of this value T/C or A/C should be ignored or maybe merged!?!

ADD REPLY
0
Entering edit mode

A person could indeed have both genotypes present if they inherited one from their mother and the other from their father. In this case, both of these probes would fluoresce and return signal above the background threshold.

I do not know what your ultimate aim is, so, cannot really comment further. Note that each SNP will have an associated allele frequency, indicating its frequency in a given population.

ADD REPLY
0
Entering edit mode

I don't understand how this is possible. In this case why all the genotypes contain two letters ( in my above example AA or TC ) instead of one?

ADD REPLY
0
Entering edit mode

In your example, the sequences are AC and TC. Each of us carries 2 copies of each autosomal chromosome. Based on the fusion of gametes, one from the mother and one from the father, our DNA can differ at individual bases. For example, at rs1801214, you may inherit A from your mother and T from your father.

Without further information on what you are trying to do and from where you obtained your data, I cannot really help you any further.

ADD REPLY
0
Entering edit mode

Are you saying that when it says AC, it doesn't mean heterozygous A from one parent and C from another parent, but that the probe picked up AC at location 4:6301295-6301296?

I find that hard to believe, as most of the lines in the file I am loking at are two letters that are the same, which would mean that it is giving A/A homozygous when it says AA

To me it just looks like that the file type doesn't put a slash between the two strands.

AX-107672483    GGAGGGAG    19  16406293    rs150023256
AX-107672496    GG  1   52794278    rs11206019
AX-107672683    CTGACTGA    17  59113900    rs147383186

so that would be GGAG/GGAG, G/G and CTGA/CTGA. You wouldn't need two lines with the same rsID to tell if it was homozygous or heterozygous at each location.

ADD REPLY

Login before adding your answer.

Traffic: 1993 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6