Question: Problems with MAP/PED files manipulation
2
gravatar for Cindy Chan
5.0 years ago by
Cindy Chan20
United Kingdom
Cindy Chan20 wrote:

Hi all,

I'm very new to GWAS. I am given a set of plink files for the gene coordinates which I'm interested to look into. However, I would need to do data cleanup. Some of the problems which I currently am confused include:

  1. The files seemed to have originated from phased data. Each ID is repeated twice, e.g. GSM0001_A, GSM0001_B. Does that mean they are the same samples but different chromosome? I don't know what to do with it. Or is it better if I start with unphased data?
  2. The samples only have IID (repeated twice in the PED files) There are no FID, maternal or paternal ID, sex). I am also given a separate Excel file that contains the sex of each sample. Can I include the sex into my PED files (refer to above question for some info)?
  3. The MAP files don't come with SNP identifiers, only the location (e.g. 2:112774105). Is it possible for me to include rs# in the MAP file? I would like to check if the existing SNPs on the genes I'm interested in are also found in the dataset. What can I do if I couldn't include the rs# in MAP file?
  4. What other file formats are used in GWAS which enables me more control over what I want to analyse in future? I'm very confused right now and feel constrained with what I can do...
  5. I need to clean up my data before I could run any analysis, as there are duplicated samples, samples which are closely related and samples without geographical information. All I have with me, besides the PLINK files, is an Excel spreadsheet which contains the information for all the samples. What would you suggest I do?

Any form of advice will be greatly appreciated.

Thanks!!

ped file snp plink gwas map file • 4.8k views
ADD COMMENTlink modified 5.0 years ago by Philipp Bayer6.4k • written 5.0 years ago by Cindy Chan20
1
gravatar for Philipp Bayer
5.0 years ago by
Philipp Bayer6.4k
Australia/Perth/UWA
Philipp Bayer6.4k wrote:

PLINK can do tons of things so it's no wonder that you're feeling a bit lost, I remember when I started out with it I was completely overwhelmed.

>The files seemed to have originated from phased data. Each ID is repeated twice, e.g. GSM0001_A, GSM0001_B. Does that mean they are the same samples but different chromosome? I don't know what to do with it. Or is it better if I start with unphased data?

Not sure if I understand correctly - PED files need a family ID and an individual ID, in absence of a family ID most people just repeat the individual ID twice. Maybe this is what you see?

>The samples only have IID (repeated twice in the PED files) There are no FID, maternal or paternal ID, sex). I am also given a separate Excel file that contains the sex of each sample. Can I include the sex into my PED files (refer to above question for some info)?

Yes, definitely! So it looks like this right now?

GSM0001_A GSM0001_B  A A  G G  A C ..... 

In that case, I'd add sex to all of them, and add  0s for maternal and paternal ID, here with 1 (male) for sex (2 is female, everything else is unknown), and -9 (unknown) for the one phenotype. I always give the phenotypes in an additional file as a normal spreadsheet.

GSM0001_A GSM0001_B 0 0 1 -9 A A  G G  A C ..... 

That way, you can correct for sex when you run a regression using the --sex flag, for example to run a logistic regression, correcting for gender, on all of your phenotypes in your file called 'your_pheno_file.csv':

plink --file your_files --pheno your_pheno_file.csv --sex --logistic --adjust --out your_results --all-pheno

>The MAP files don't come with SNP identifiers, only the location (e.g. 2:112774105). Is it possible for me to include rs# in the MAP file? I would like to check if the existing SNPs on the genes I'm interested in are also found in the dataset. What can I do if I couldn't include the rs# in MAP file?

That happened just recently to me, I used KAVIAR to get the rs# for all my SNPs, just copy paste from excel the chromosome and position into here: http://db.systemsbiology.net/kaviar/cgi-pub/Kaviar.pl

Then you can use Excel or a small script to insert your rsids. Keep in mind that not all SNPs have rsids!

>I need to clean up my data before I could run any analysis, as there are duplicated samples, samples which are closely related and samples without geographical information. All I have with me, besides the PLINK files, is an Excel spreadsheet which contains the information for all the samples. What would you suggest I do?

If you have many SNPs, clean them using minor allele frequency and HWE, at the very least.

plink --file your_files --maf 0.05 --hwe --recode --out your_cleaned_files

To also remove empty individuals:

plink --file your_files --maf 0.05 --hwe --mind 0.8 --recode --out your_cleaned_files

Will remove individuals with more than 80% missing alleles.

If you have population stratification, you can use Plink's own IBS clustering to correct for that:

plink --file your_cleaned_files --cluster --ppc 0.01

This will create 4 files with clusters. Check them manually to see whether they conform to what you expect. Then to use these clusters in a different GWAS:

plink --file your_cleaned_files -mh --within plink.cluster2

I'm currently unsure whether it was cluster2 or cluster3 - just run it and have a look at the log, it should say 'X individuals assigned to Y clusters', where X and Y make sense.

You can also use STRUCTURE or EIGENSTRAT to correct for population stratification. I personally prefer the latter because the pictures are prettier :) EIGENSTRAT also takes your ped files. You can feed these into PLINK as covariates, have a look here: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#covar

You can also play with GAPIT or TASSEL, which run analyses similar to PLINK, but are a bit easier to use.

I might have typos in the above commands, I haven't tested them right now

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by Philipp Bayer6.4k

Dear Philipp,

Thanks for the reply...

The PED files looked like this:

GS00001-ASM_A GS00001-ASM_A 0 0 0 1 C C T T G G 

GS00001-ASM_B GS00001-ASM_B 0 0 0 1 C C C C A A 

That's why I am confused...

ADD REPLYlink written 5.0 years ago by Cindy Chan20

That looks good! Here's the manual for the PED format: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped

     Family ID (GS00001-ASM_A)
     Individual ID (GS00001-ASM_A) 
     Paternal ID (0)
     Maternal ID (0)
     Sex (1=male; 2=female; other=unknown) (0)
     Phenotype (1)
     followed by SNPs.

Like I wrote above, most people just use the same ID for family and individual, since you rarely get well-defined families. Paternal and maternal are set to missing, which is what most people do. Sex is set to missing too - since you have the gender in another table, you might want to fix that. The phenotype is set to 1, unaffected (2 is affected, -9 and 0 are missing). Like I wrote above, I usually set that phenotype to 0 and make my own additional table of phenotypes as described here: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#pheno

Looks good to me!

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Philipp Bayer6.4k

Thanks! I've been reading the plink documentation. Just find it tough since there's not "search" function on the website.

Just wondering, so I have phase data, which means each of my samples are analysed twice? trying to figure out how does plink works...

thanks!

ADD REPLYlink written 5.0 years ago by Cindy Chan20

This older thread on biostars has several good explanations of phased vs unphased data, better than I could explain: What Are Phased And Unphased Genotypes?

ADD REPLYlink written 5.0 years ago by Philipp Bayer6.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 796 users visited in the last hour