8.6 years ago
romsen ▴ 60

Hello again

I ´have to convert Illumina HumanHap chip data into PLINK (PED file). I'll proceed as described here. But my generated ped file shows only 0 for each genotype. Plink is warning during the process:

[...] 50 males, 50 females, and 0 of unspecified sex

Before frequency and genotyping pruning, there are 1000000 SNPs

100 founders and 0 non-founders found

1000000 SNPs with no founder genotypes observed

Warning, MAF set to 0 for these SNPs (see --nonfounders)

Writing list of these SNPs to [ plink.nof ]

Total genotyping rate in remaining individuals is 0 [...]

fam-file:

    1    192    0    0    1    0
2    193    0    0    2    0
3    213    0    0    1    0
4    214    0    0    1    0


map-file:

1    rs3934834    0    995669
1    rs3737728    0    1011278
1    rs6687776    0    1020428
1    rs9651273    0    1021403


lgen-file:

[Header]
BSGT Version    3.0.27
Processing Date
Content
Num SNPs    1000000
Total SNPs    1000000
Num Samples    100
Total Samples    100
[Data]
Sample Index    Sample Name    SNP Name    Allele1     Allele2
1    192    rs10000010    A    G
2    193    rs10000010    A    G
3    213    rs10000010    A    G


My lgen file has a 10 row header then the data-rows are following. The information about the genotype is given by the forward alleles exportet via beadstudio (With Top Alleles the same sobering result)

After running plink to reconstruct ped file I get this ped file with missing genotypes:

1 192 0 0 1 -9 0 0 0 0 0 0 0 0 [...]
2 193 0 0 2 -9 0 0 0 0 0 0 0 0 [...]
3 213 0 0 1 -9 0 0 0 0 0 0 0 0 [...]


Perhaps one of you, find the mistake or have an idea to solve the problem. Do I need a reference file or is the title in the lgen-file the problem? Thank you very much.

illumina plink • 6.8k views
Perfect. Thank you. PLINK starts to work now but there is a new error. In my file there are too many Allels.

ERROR: Locus rs10000023 has >2 alleles:
individual 12 070 has genotype [ - - ]
but we've already seen [ T ] and [ G ]

I've edited my answer to address the issue.

nice. I had the same idea. I use windows therefore do you have a script for plink or perl?

0
Entering edit mode

If you can manage to open the file in a text editor and perform "find and replace" on all "-" to "0", I think that should work, otherwise, if you are going to do much bioinformatics work in Windows I would suggest installing and becoming familiar with Cygwin.

Unfortunatly it's to big. I can't open it in notepad.

As a slightly less intimidating alternative to installing Cygwin for sed functionality, you can probably use this blog post about Powershell.

Hehe, thanks I check this. Now I get it with perl.

perl -p -i.bak -e "~s|-|0|" file.lgen

Hi. I'm facing a similar issue to the above. I have made a .ped file from a beadstudio report but my missing values are specified as "-" rather than 0. The file is too big to find and replace using Nano and the above perl command replaces only the first occurence (in this case changing the phenotype specification "-9" to "09"). I'm not familiar with perl or command line operations and wondered if anyone could help?

I've managed to get around the phenotype issue by using:

perl -p -i.bak -e "~s|- |0 |" file.lgen


But this is still only dealing with the first occurences in the ped file.

perl -p -i.bak -e "~s|- |0 |g" file.lgen


Fixes this for anyone encountering a similar problem.

8.6 years ago

I would try removing the header from your lgen file. The PLINK documentation gives an example without the header. If you are using Linux, try:

egrep '^[0-9]+' lgen-file > lgen-file.noheader


This will remove the header, and then should replace all occurrences of "- -", which seems to be Illumina's notation for missing alleles, with "0 0", which seems to be PLINK's notation for missing alleles.