Problems with MAP/PED files manipulation
1
2
Entering edit mode
9.6 years ago
Cindy Chan ▴ 20

Hi all,

I'm very new to GWAS. I am given a set of plink files for the gene coordinates which I'm interested to look into. However, I would need to do data cleanup. Some of the problems which I currently am confused include:

  1. The files seemed to have originated from phased data. Each ID is repeated twice, e.g. GSM0001_A, GSM0001_B. Does that mean they are the same samples but different chromosome? I don't know what to do with it. Or is it better if I start with unphased data?
  2. The samples only have IID (repeated twice in the PED files) There are no FID, maternal or paternal ID, sex). I am also given a separate Excel file that contains the sex of each sample. Can I include the sex into my PED files (refer to above question for some info)?
  3. The MAP files don't come with SNP identifiers, only the location (e.g. 2:112774105). Is it possible for me to include rs# in the MAP file? I would like to check if the existing SNPs on the genes I'm interested in are also found in the dataset. What can I do if I couldn't include the rs# in MAP file?
  4. What other file formats are used in GWAS which enables me more control over what I want to analyse in future? I'm very confused right now and feel constrained with what I can do...
  5. I need to clean up my data before I could run any analysis, as there are duplicated samples, samples which are closely related and samples without geographical information. All I have with me, besides the PLINK files, is an Excel spreadsheet which contains the information for all the samples. What would you suggest I do?

Any form of advice will be greatly appreciated.

Thanks!

GWAS SNP PED MAP PLINK • 7.1k views
ADD COMMENT
1
Entering edit mode
9.6 years ago

PLINK can do tons of things so it's no wonder that you're feeling a bit lost, I remember when I started out with it I was completely overwhelmed.

The files seemed to have originated from phased data. Each ID is repeated twice, e.g. GSM0001_A, GSM0001_B. Does that mean they are the same samples but different chromosome? I don't know what to do with it. Or is it better if I start with unphased data?

Not sure if I understand correctly - PED files need a family ID and an individual ID, in absence of a family ID most people just repeat the individual ID twice. Maybe this is what you see?

The samples only have IID (repeated twice in the PED files) There are no FID, maternal or paternal ID, sex). I am also given a separate Excel file that contains the sex of each sample. Can I include the sex into my PED files (refer to above question for some info)?

Yes, definitely! So it looks like this right now?

GSM0001_A GSM0001_B  A A  G G  A C ..... 

In that case, I'd add sex to all of them, and add 0s for maternal and paternal ID, here with 1 (male) for sex (2 is female, everything else is unknown), and -9 (unknown) for the one phenotype. I always give the phenotypes in an additional file as a normal spreadsheet.

GSM0001_A GSM0001_B 0 0 1 -9 A A  G G  A C ..... 

That way, you can correct for sex when you run a regression using the --sex flag, for example to run a logistic regression, correcting for gender, on all of your phenotypes in your file called 'your_pheno_file.csv':

plink --file your_files --pheno your_pheno_file.csv --sex --logistic --adjust --out your_results --all-pheno

The MAP files don't come with SNP identifiers, only the location (e.g. 2:112774105). Is it possible for me to include rs# in the MAP file? I would like to check if the existing SNPs on the genes I'm interested in are also found in the dataset. What can I do if I couldn't include the rs# in MAP file?

That happened just recently to me, I used KAVIAR to get the rs# for all my SNPs, just copy paste from excel the chromosome and position into here: http://db.systemsbiology.net/kaviar/cgi-pub/Kaviar.pl

Then you can use Excel or a small script to insert your rsids. Keep in mind that not all SNPs have rsids!

I need to clean up my data before I could run any analysis, as there are duplicated samples, samples which are closely related and samples without geographical information. All I have with me, besides the PLINK files, is an Excel spreadsheet which contains the information for all the samples. What would you suggest I do?

If you have many SNPs, clean them using minor allele frequency and HWE, at the very least.

plink --file your_files --maf 0.05 --hwe --recode --out your_cleaned_files

To also remove empty individuals:

plink --file your_files --maf 0.05 --hwe --mind 0.8 --recode --out your_cleaned_files

Will remove individuals with more than 80% missing alleles.

If you have population stratification, you can use Plink's own IBS clustering to correct for that:

plink --file your_cleaned_files --cluster --ppc 0.01

This will create 4 files with clusters. Check them manually to see whether they conform to what you expect. Then to use these clusters in a different GWAS:

plink --file your_cleaned_files -mh --within plink.cluster2

I'm currently unsure whether it was cluster2 or cluster3 - just run it and have a look at the log, it should say 'X individuals assigned to Y clusters', where X and Y make sense.

You can also use STRUCTURE or EIGENSTRAT to correct for population stratification. I personally prefer the latter because the pictures are prettier :) EIGENSTRAT also takes your ped files. You can feed these into PLINK as covariates, have a look here: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#covar

You can also play with GAPIT or TASSEL, which run analyses similar to PLINK, but are a bit easier to use.

I might have typos in the above commands, I haven't tested them right now

ADD COMMENT
0
Entering edit mode

Dear Philipp,

Thanks for the reply...

The PED files looked like this:

GS00001-ASM_A GS00001-ASM_A 0 0 0 1 C C T T G G
GS00001-ASM_B GS00001-ASM_B 0 0 0 1 C C C C A A

That's why I am confused...

ADD REPLY
0
Entering edit mode

That looks good! Here's the manual for the PED format: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped

     Family ID (GS00001-ASM_A)
     Individual ID (GS00001-ASM_A) 
     Paternal ID (0)
     Maternal ID (0)
     Sex (1=male; 2=female; other=unknown) (0)
     Phenotype (1)
     followed by SNPs.

Like I wrote above, most people just use the same ID for family and individual, since you rarely get well-defined families. Paternal and maternal are set to missing, which is what most people do. Sex is set to missing too - since you have the gender in another table, you might want to fix that. The phenotype is set to 1, unaffected (2 is affected, -9 and 0 are missing). Like I wrote above, I usually set that phenotype to 0 and make my own additional table of phenotypes as described here: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#pheno

Looks good to me!

ADD REPLY
0
Entering edit mode

Thanks! I've been reading the plink documentation. Just find it tough since there's not "search" function on the website.

Just wondering, so I have phase data, which means each of my samples are analysed twice? trying to figure out how does plink works...

thanks!

ADD REPLY
0
Entering edit mode

This older thread on biostars has several good explanations of phased vs unphased data, better than I could explain: What Are Phased And Unphased Genotypes?

ADD REPLY

Login before adding your answer.

Traffic: 2321 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6