GWAS analysis using PLINK
7.5 years ago
ngsgene

I have a GenomeStudio genotype file with missing genotypes denoted by -

Using this file I generated, for each chromosome the map, fam and lgen files and using the --recode option in plink converted them to ped format. To overcome the plink Error: Locus has >2 alleles I used the --missing-genotype option with the -

After ped files for each chromosome were successfully generated, there are a couple issues am facing:

My lgen file corresponds to the map file - but after recode the ped file has way more columns than the rows. I excpect the number of columns to be rows x 2 (both alleles) that of the map file.

When I try to merge all the chromosomes for evaluating summary statistics the - in the data doesn't seem to be excluded and continue to give errors.

Would converting all the - to 0 is the solution here? Am trying to understand how to exclude such data and best practices.

Thanks for any suggestions/feedback.

plink gwas merge missing-genotype
7.5 years ago
1. You probably want to use both --missing-genotype - and --output-missing-genotype 0 during your conversion; this tells PLINK that the input fileset uses -, but you want the output fileset to use 0 so you don't have more headaches down the line.
2. Can you explain what you mean by the "ped file has way more columns than [you expected]"? How many columns does it have? How many rows does the map file have?
3. Is there any particular reason you are converting to .ped/.map instead of PLINK's preferred .bed/.bim/.fam format?
Thanks for your response chrchang523, will give --output-missing-genotype 0 a try to get the format working.

The map files have various number of rows, pertaining to the number of SNPs in each chromosome, for example I have ~180000 for chr1, so I expect the ped file to have 180000 * 2 columns.

The only reason for .ped is to be able to see what data am generating, aim is to work with .bed/.bim format once the file formatting is taken care of

How many columns does the .ped actually have?

You might want to try converting to .tped/.tfam (--recode --transpose) instead, that text format might be easier to read (and it's definitely more convenient for PLINK to work with).

The --output-missing-genotype 0 option has helped replace all - to 0. But in either case the --merge option (using this to merge data from all chr) still reports an ERROR: Problem with MAP file line: There doesn't seem to be a way for me to track down which snp in particular is giving the issue as its reporting the first 6 columns for sample identifier and genotype info from the lgen file.

The .ped file now has ~180000 * 2 + 6 columns so that seems to have been correctly generated. Thanks for tip on transpose, are there other pros transposing the data - or this a preferred file format? Plan to impute this using 1000 Genomes, none of the info on Shapeit/Impute2 has suggested a .tped file yet - but please let me know if you have experience with that.

ERROR: Problem with MAP file line:
0 ###-# 0 0 1 -9 G G A A A A C C C C A G A A C C C C G G A G C T T C A A C C G G A A T T A A C T C C A G G G C C C T T C T T T T T T A A C T C C C C G G G G G A T C C C C T A G G G C C A G G G A A A A G G A A T T T T T T G G A A C C C C C C G G G G A A C

The "problematic MAP file line" is a properly formatted .ped file line. Try swapping the order of the arguments you're passing to --merge.

.tped files have fewer columns than .ped files, so I find them easier to work with in a text editor. If you're using --merge, though, .ped/.map lets you avoid an extra conversion step.

Thanks chrchang523! I am able to merge the files successfully, seems the order of .map .ped in the file list was causing the issue. Take home msg: the order of the file list to be merged should be .ped .map / .bed .bim .fam

