Question

Different samples order in .ped and phynotype data files

0

Entering edit mode

5.7 years ago

Denis ▴ 310

I'm preparing imput files to make a GWAS with Plink. I've converted my VCF file to .ped and .map files in Plink. I have a separate file with multiple phenotype data. The samples order is different in resulted .ped file in comparison to phenotype data file. I'm wondereing, if it's important to make the samples order exactly the same in both mentioned above files? If so, is there any easy way to do it? Will it be Ok if i'll sort my .ped file by the 1-st column by unix sort command, but will change nothing in my initial .map file?

Plink GWAS R • 3.1k views

ADD COMMENT • link 5.7 years ago by Denis ▴ 310

score 3 · Accepted Answer · 2018-08-01

3

Entering edit mode

5.7 years ago

Kevin Blighe 87k

Yes, this caught me off guard long ago, too, and I went crazy with it trying to figure it out.

You can use an individual sort file in order to control the ordering of your samples as it's converted from VCF / BCF to plink format, by using --indiv-sort file:

plink --noweb --bcf Caucasian.bcf --keep-allele-order --indiv-sort file CaucasianIDSort.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out Caucasian ;

The file, CaucasianIDSort.list, contains 2 columns with FID and IID:

This ordering should obviously match whatever FAM file you are using, too. I always use a custom FAM and specify it in all analyses with

--fam CaucasianCustom.fam

--------------------------------

If you need to worry about family structure, then these could be encoded in your VCF / BCF as FID_IID (you can modify VCF headers with bcftools reheader). You then read these into PLINK, maintaining family information, with:

plink --noweb --bcf Families.bcf --keep-allele-order --indiv-sort file FamiliesIDsort.list --vcf-idspace-to _  --id-delim _ --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out Families ;

FamiliesIDsort.list looks like:

Fam1    16367
Fam1    16407
Fam2    16402
Fam2    16382
Fam3    16392
Fam3    16362
Fam4    16372
Fam4    16377
Fam5    16397
Fam5    16387
Fam6    16396
Fam6    16366
Fam7    16405
Fam7    16400
Fam8    16410
Fam8    16376

The sample IDs in the input VCF / BCF look like:

Fam1_16367
Fam1_16407
Fam2_16402
Fam2_16382
Fam3_16392
et cetera

Kevin

ADD COMMENT • link 5.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin! Many thanks for your detailed response! It's very useful. I'm wondering wether string order in .ped file somehow linked to string order of my .map file? Probably not, but i'm not 100% sure. Am i right? Besides how is it possible to convert binary files to plain text ones. I'm just learning Plink and would prefer to work with text files. Thank you again! Denis

ADD REPLY • link 5.7 years ago by Denis ▴ 310

0

Entering edit mode

Hey dude. The map file just contains information on the variants in the PED file. You could re-order the samples (rows) in the PED file but it would be wrong to re-order the columns (variants), because those are inextricably linked to the MAP file.

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin! Thank you so much!

ADD REPLY • link 5.7 years ago by Denis ▴ 310

0

Entering edit mode

Hi Kevin! I'd like to clarify one more thing related to my post. I input into analysis phenotype data for all samples as a separate file. I followed by your advises and indicated --indiv-sort in my command line. But because of high missing genotype rate for a number of individuals, they were filtered out. As a result .ped file has a different IIDs number and its order in comparison to my phenotype data file. Will it be a problem for Plink to correctly merge genotype and phenotype data for the association test in that case?

ADD REPLY • link 5.7 years ago by Denis ▴ 310

1

Entering edit mode

If you filter out variants, then it is no problem. If you filter out samples... then I would recommend updating your phenotype information, i.e., in your FAM file, and ensure that it matches the same order of FID and IID as your --indiv-sort file

It may involve going back and forward a bit...

Generally, always ensure that your phenotype files and sample sort files have the same order and content (for FID and IID). Plink does not check that the ordering in the PED is the same as per your custom files.

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you so much for your help Kevin, I finally managed to use plink assoc to do the test, and this time I did it for 16 samples: 8 control and 8 case. (just for testing). This is what I got:

CHR SNP BP  A1  C_A C_U A2  CHISQ   P   OR  SE  L95 U95
1   .   27022976    A   1   1   G   0   1   1   2   0.01984 50.4
1   .   27022990    A   1   NA  G   0   1   NA  NA  NA  NA
1   .   27022992    G   1   1   A   0   1   1   2   0.01984 50.4
1   .   27023030    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023035    C   NA  1   CG  0   1   NA  NA  NA  NA
1   .   27023037    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023048    A   1   NA  G   0   1   NA  NA  NA  NA
1   .   27023059    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023063    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023066    G   NA  1   A   0   1   NA  NA  NA  NA
1   rs560502657 27023071    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023072    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023094    G   NA  1   A   0   1   NA  NA  NA  NA

This means there is no real association between disease and mutations when P values are 1 and that there is an association when P values are 0? and why is chisq always 0? maybe I need to tune something else? Thanks a lot!

ADD REPLY • link 5.3 years ago by Pin.Bioinf ▴ 340

0

Entering edit mode

Yes, each of these p-values appears to be 1. Also, with 8 versus 8, you will find it difficult to get many statistically significant findings from this. In fact, from what I can see, all of the variants listed above are only found in 1 or 2 individuals (look at the C_A and C_U columns). An NA will be returned for the p-value if the variant / SNP is missing in a simple (these would have been encoded as ./. in the VCF, whereas homozygous ref would be encoded 0/0).

ADD REPLY • link 5.3 years ago by Kevin Blighe 87k