PLINK can do tons of things so it's no wonder that you're feeling a bit lost, I remember when I started out with it I was completely overwhelmed.
>The files seemed to have originated from phased data. Each ID is repeated twice, e.g. GSM0001_A, GSM0001_B. Does that mean they are the same samples but different chromosome? I don't know what to do with it. Or is it better if I start with unphased data?
Not sure if I understand correctly - PED files need a family ID and an individual ID, in absence of a family ID most people just repeat the individual ID twice. Maybe this is what you see?
>The samples only have IID (repeated twice in the PED files) There are no FID, maternal or paternal ID, sex). I am also given a separate Excel file that contains the sex of each sample. Can I include the sex into my PED files (refer to above question for some info)?
Yes, definitely! So it looks like this right now?
GSM0001_A GSM0001_B A A G G A C .....
In that case, I'd add sex to all of them, and add 0s for maternal and paternal ID, here with 1 (male) for sex (2 is female, everything else is unknown), and -9 (unknown) for the one phenotype. I always give the phenotypes in an additional file as a normal spreadsheet.
GSM0001_A GSM0001_B 0 0 1 -9 A A G G A C .....
That way, you can correct for sex when you run a regression using the --sex flag, for example to run a logistic regression, correcting for gender, on all of your phenotypes in your file called 'your_pheno_file.csv':
plink --file your_files --pheno your_pheno_file.csv --sex --logistic --adjust --out your_results --all-pheno
>The MAP files don't come with SNP identifiers, only the location (e.g. 2:112774105). Is it possible for me to include rs# in the MAP file? I would like to check if the existing SNPs on the genes I'm interested in are also found in the dataset. What can I do if I couldn't include the rs# in MAP file?
That happened just recently to me, I used KAVIAR to get the rs# for all my SNPs, just copy paste from excel the chromosome and position into here: http://db.systemsbiology.net/kaviar/cgi-pub/Kaviar.pl
Then you can use Excel or a small script to insert your rsids. Keep in mind that not all SNPs have rsids!
>I need to clean up my data before I could run any analysis, as there are duplicated samples, samples which are closely related and samples without geographical information. All I have with me, besides the PLINK files, is an Excel spreadsheet which contains the information for all the samples. What would you suggest I do?
If you have many SNPs, clean them using minor allele frequency and HWE, at the very least.
plink --file your_files --maf 0.05 --hwe --recode --out your_cleaned_files
To also remove empty individuals:
plink --file your_files --maf 0.05 --hwe --mind 0.8 --recode --out your_cleaned_files
Will remove individuals with more than 80% missing alleles.
If you have population stratification, you can use Plink's own IBS clustering to correct for that:
plink --file your_cleaned_files --cluster --ppc 0.01
This will create 4 files with clusters. Check them manually to see whether they conform to what you expect. Then to use these clusters in a different GWAS:
plink --file your_cleaned_files -mh --within plink.cluster2
I'm currently unsure whether it was cluster2 or cluster3 - just run it and have a look at the log, it should say 'X individuals assigned to Y clusters', where X and Y make sense.
You can also use STRUCTURE or EIGENSTRAT to correct for population stratification. I personally prefer the latter because the pictures are prettier :) EIGENSTRAT also takes your ped files. You can feed these into PLINK as covariates, have a look here: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#covar
You can also play with GAPIT or TASSEL, which run analyses similar to PLINK, but are a bit easier to use.
I might have typos in the above commands, I haven't tested them right now