I am new to bioinformatics, and I have some new SNP data from an Affymetrix Axiom array. I have the genotypes exported into a giant tab-delimited table txt file where each row is a sample, starting with the rsID and each column being a sample.
Due to a quirk of the Axiom Human Origins array, there are ~4000 SNPs that were genotyped twice for each sample. The Affymetrix genotyping console for whatever reason does not merge the genotypes for these probes, meaning these genotypes show up twice in my data. Furthermore, the array designers fear these SNPs may actually be triallelic, which means I probably don't want to have to deal with them even more (ftp://ftp.cephb.fr/hgdp_supp10/8_12_2011_Technical_Array_Design_Document.pdf).
I have this big table of genotyping data. Can someone show me a template Python (or maybe Perl) script I can used to filter out the ~8000 lines that contain one of the offending rsids? I have a basic grip of these languages, but I don't know how to do this stuff on my own. Thanks!
do you have an access to a linux-based os ?
I have a Bio-Linux (Ubuntu) VirtualBox that I run through Windows.
and can you show us what the very first lines look like ?
Here are the first few lines. Later columns have been deleted to make it easier to read (There are 92 samples originally)
Probe Set ID Chromosome Chromosomal Position dbSNP RS ID Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 Sample11
AFFX-KIT-000001 9 101258881 rs1000440 TC TC TC TT TC CC TC CC TC CC CC
AFFX-KIT-000002 4 164934874 rs10007601 AG AG AG AG AG AA AA AA AG AA AA
AFFX-KIT-000003 5 163542505 rs10056215 TT TT TG TT TT TT TT TG TT TT TT
AFFX-KIT-000004 5 2993645 rs10075407 GG AA AG AA AA AA AG AG AG GG AG
AFFX-KIT-000008 2 149188375 rs10196277 TG TT GG GG TG TG TT TT TT GG TG
AFFX-KIT-000009 12 51789617 rs1021996 CC CC TC TC CC CC CC CC CC TC CC
AFFX-KIT-000012 7 152738841 rs10266230 TT TC TT TC CC TC TC TT TC CC TT
AFFX-KIT-000014 13 27573612 rs10507375 GG GG TG GG GG GG GG GG GG GG GG
AFFX-KIT-000015 17 44878268 rs10514911 CC TC CC CC CC CC CC CC CC CC CC
AFFX-KIT-000016 5 168199301 rs10516050 CC TT TC CC CC CC CC CC CC CC TC
AFFX-KIT-000017 12 57489709 rs1059513 TC TC TT TT TC TT TT TT TT TT TT
AFFX-KIT-000018 10 129853731 rs10741141 AG GG AA AA AA AG GG GG AG AA AG
AFFX-KIT-000019 12 61604963 rs10784186 AC AA AC AA AA AA AA AC AC --- CC
AFFX-KIT-000021 4 28895078 rs10805281 GG AA GG AA AA AG AG AG AA AA AA
AFFX-KIT-000022 10 130069931 rs10829369 TC TC TC TC TT TC TC TT TT CC CC