Question

Importing SNP and phenotype data from dbGaP into R

0

Entering edit mode

7.5 years ago

rdlady ▴ 40

I am working on a project that consists of finding associations between SNPs and certain phenotypes using data sets from the dbGaP database. I have found some interesting data sets, I downloaded them from dbGaP, decrypted, and extracted them.

This resulted in some folders with idat format files, gtc.txt files, and some phenotype files in xml format.

I would like to use this data as input for analyzing it in R with packages like SNPassoc, snpMatrix, or GenABEL.

The problem is that it seems that the supported input format of these R packages is a tab delimited table in plain text format, which consist of the sample ID, phenotype data, SNP content, etc. This format is very different from the idat, gtx.txt and xml formats that I found in the dbGaP data.

Is there an R package or any script/program that can take all the dbGaP data (idat, gtx.txt, and phenotype info in xml) and generate summary tables like that one required by the R packages?

Here are some examples of the files found in the dbGaP data that I have extracted:

gtc.txt:

SNP Name GC Score Allele1 - Top Allele2 - Top
Allele1 - AB Allele2 - AB X Y Raw X Raw Y 200003 0.9226053 A A A A 0.934740661177471 0.0394069163635861 7614 1009 200006 0.80280876 G G B B 0.03840060068975691 1.5842950219375036 788
19290 200047 0.7352572 A A A A
0.42971193949905434 0.03922872128858323 3636 953 200050 0.789192 G G B B 0.020351741593668694 1.0929231320570174 545 9315 200052 0.9563731 T T B B 0.01696443095800867 0.9911898858364148 945
12561

phenotype xml:

?xml-stylesheet type="text/xsl" href="varreports_v3.xsl"?>data_table name="MEC_XXXXXX_Subject" dataset_id="XXXXXX" study_name="A Multiethnic GWAS of XXXXXX" study_id="phs000306.v4" participant_set="1" date_created="04/10/2014"><variable id="XXXXXXX.v2.p1" var_name="SUBJID" calculated_type="string" reported_type="integer"><description>XXXXX ID</description><total><subject_profile><sex><male>9454</male><female>13</female></sex></subject_profile><stats><stat n="9482" nulls="0"/></stats></total></variable><variable id="XXXXX.v2.p1.c1" var_name="SUBJID" calculated_type="string" reported_type="integer"><description>XXXX ID</description><total><subject_profile><sex><male>2467</male></sex></subject_profile><stats>

idat is a binary format and can't be read as plain text.

R genome SNP • 2.9k views

ADD COMMENT • link 7.3 years ago by rdlady ▴ 40

score 0 · Answer 1 · 2016-11-09

0

Entering edit mode

7.5 years ago

rdlady ▴ 40

Anyone knows how to analyze dbGaP data in R?

ADD COMMENT • link 7.5 years ago by rdlady ▴ 40

score 0 · Answer 2 · 2016-11-09

0

Entering edit mode

7.5 years ago

rdlady ▴ 40

I have decrypted the dbGaP files but now the problem is that I can't map the phenotype files to genotype files, so I have a bunch of information on SNPs but I don't know to who they belong (cases or controls, male or female, age, etc). Does anyone know how to map the genotypes to phenotypes in the dbGaP data sets?

ADD COMMENT • link 7.5 years ago by rdlady ▴ 40

1

Entering edit mode

Sorry, but I have a question.

To access dbGaP database do I need special account?

Thank you so much!

ADD REPLY • link 7.5 years ago by 496527 ▴ 10

0

Entering edit mode

You probably should request an account to access all the content of dbGaP database, because some datasets are not open to the public. In my case I had to request an account because I needed to have access these closed datasets.

ADD REPLY • link 7.3 years ago by rdlady ▴ 40

score 0 · Answer 3 · 2017-01-11

I ended up never being able to use those XML files as source of Phenotype data, but I was able to find phenotype data files in a very simple tabular text format when I requested the download of my dbGaP dataset again. So for some reason, only the XML files where available when I made the first download request. Now with the tabular text files I was able to extract the phenotype data very easily, using R's GenABEL package.