Question

Environmental Genome Project Data - Incorporated Into Any Other Databases?

1

Entering edit mode

10.8 years ago

pilotlog ▴ 40

Hi everyone,

I came across the National Institute of Environmental Health Sciences Environmental Genome Project paper recently (2004 Livingston - Pattern of sequence variation across 213 environmental response genes... http://www.ncbi.nlm.nih.gov/pubmed/15364900), and as I am looking at selecting tag SNPs in a bunch of DNA repair genes I was interested in checking whether they have more genotype data/SNPs I could use in addition to HapMap.

The data seems to be located on this site... http://egp.gs.washington.edu/data_download.html but I was wondering, since this is a fairly old paper, whether this data had been incorporated into any other data repositories (I don't want to accidentally double up on data?). The paper said that they uploaded their newly discovered SNPs and SNP frequency data to dbSNP but I wasn't sure whether the genotype data itself would be on there, and how to get it if it was. Does anyone know anything about this project that could help me... a lot of the links in the original paper are dead.

I am downloading their bulk file anyway to have a look but I am on a deadline and I thought I would just see if anyone here already knew anything about this to save some time... I don't even know if it would be useful/more dense SNP data than the current HapMap for tag SNP selection/LD determination.

Thanks if anyone has any info!

• 2.1k views

ADD COMMENT • link 10.8 years ago by pilotlog ▴ 40

score 0 · Answer 1 · 2013-07-15

I am not sure if I should put this as a comment or answer, but I have been looking through the EGP data and I thought I would share what I found... any comments or advice please feel free.

I downloaded the files from the website above, within the main bulk data file there are folders for each gene that they covered. I picked one smallish gene to try out (XRCC2). Within each folder there are lots of files, most of which I didn't look at, but the one that seems to contain the genotyping data per SNP per sample was called xrcc2.prettybase.txt. This contained 4 columns, col 1 = EGP's internal ID number for each SNP, col 2 = EGP's internal ID number for each person's sample, col 3 & col 4 = both had letters (A/G/T/C/N/-) which represent allele 1 and 2 respectively.

000077 P001 G G

000077 P002 G G

000077 P003 N N

[Sorry, I don't really know if that's how to post it properly.]

Some samples/SNPs had insertions/deletions which were more than one letter, deletions were "-", and I guessed missing is N. Both the SNP ID and the sample ID were in EGP's internal labelling system and there was no chr position data, so it wasn't too helpful until I found two files that related the EGP internal IDs to dbSNP rsIDs and Coriell sample IDs. These files were:

(1) sample id.txt = this one had two columns, EGP internal sample ID and corresponding Coriell sample ID beginning with NA.

EGP ID Coriell ID

P001 NA15029

P002 NA15036

There are 90 sample IDs, starting from Coriell ID NA15029 to NA15596. Some of them have As and Bs on the end. I compared the IDs with the ones I got from the HapMap CEU genotype dump for XRCC2 and none of them overlapped... the Livingston 2004 paper said this about the study population "The study population consisted of 90 DNA samples obtained from the PDR (Collins et al. 1998), a publicly available, anonymous collection of individuals that is representative of the ethnic diversity of the United States population (Coriell Institute; http://locus.umdnj.edu/nigms/products/pdr.html). The panel includes 24 European, 24 Asian, 24 African, 12 Hispanic, and six Native American samples." I haven't determined which IDs are for which ethnic group, so I just looked at everything together at the moment.

(2) rsEGP.txt.gz, which contained the gene hugo name, the gene symbol and the EGP internal SNP ID separated by "-"), the chromosome number, hg17 location, hg18 location, and dbSNP b126 rs ID.

xrcc2 XRCC2-000077 chr7 152005602 rs3218368

xrcc2 XRCC2-000297 chr7 152005382 rs3218369

As I don't really know how to use Linux/Unix well, and I don't know any programming languages, I opened the files in Excel and used Excel formulas (mainly variants of VLOOKUP) to match the EGP SNP IDs in the xrcc2.prettybase.txt file with the correct sample and SNP ids.

I also copied the rsIDs from my HapMap CEU genotype dump for the XRCC2 region into Excel and compared whether EGP data had any additional SNPs (i.e. whether it might be worth it to format the data to load into Haploview). The EGP data has 203 rsIDs for XRCC2, HapMap had only 93 SNPs in the same region, but 68 of the SNPs are the same as HapMap. So, 135 EGP SNPs are not in HapMap and 25 HapMap SNPs are not in EGP. As different individuals were typed in EGP and HapMap (no overlap in sample IDs) I thought it was worth it to look further into it.

Using Excel formulas to sift through and compare the allele observed for each SNP, I found 2 triallelic SNPs in the data, 10 indels of more than 1bp, 4 indels of 1bp. I filtered them out. Using Excel formulas again, I put the alleles together to make genotypes for each SNP and re-organise the dataset so that it would be in the format of one marker per line, with rs#, alleles (i.e. A/G), chrom, pos, in the first 4 columns, and the 90 Coriell IDs in the next 90 column headers and the genotypes for that marker under each Coriell ID. Basically I tried to make it look like the HapMap genotype dump file. I even added the same first few lines that start with #, i.e..

#Wed Jul 10 11:46:24 2013: HapMap genotype data dump, SNPs genotyped in population CEU on chr7:151969353..152009352

#For details on file format, see http://www.hapmap.org/genotypes/

rs# alleles chrom pos strand assembly# center protLSID assayLSID panelLSID QCcode NA15024 NA15025 NA15028 NA15029

And I put dummy values that resemble the HapMap data for all the columns I didn't have, like protLSID and QCcode... I didn't know if they would be that important for Haploview or not. Here's my first line..

rs6961422 C/T chr7 151969561 + ncbi_b36 perlegen urn:lsid:perlegen.hapmap.org:Protocol:Genotyping_1.0.0:2 urn:lsid:perlegen.hapmap.org:Assay:25175.5570928:1 urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1 QC+ NN TT NN NN TT CT TT NN NN

And I concatenated it with spaces " " between the columns and saved as a txt file. It wouldn't load in Haploview though, I got a strange looking error message saying "HapMap data format error: c h r 7 : 1 5 1 9 6 9 3 5 3 . . 1 5 2 0 0 9 3 5 2". I was thinking maybe Excel has added stuff to the cells maybe, like quotation marks or spaces? Or maybe I shouldn't have concatenated but saved as a space-delimited file?

Anyone that is familiar with formatting files for Haploview can you give me any advice on how to fix this so Haploview could read it?

Do you guys think this will even work/is worth it? Also, noting any mistakes I made/suggestions/improvements would be appreciated.