Question: Different samples order in .ped and phynotype data files
0
gravatar for Denis
12 months ago by
Denis110
Denis110 wrote:

I'm preparing imput files to make a GWAS with Plink. I've converted my VCF file to .ped and .map files in Plink. I have a separate file with multiple phenotype data. The samples order is different in resulted .ped file in comparison to phenotype data file. I'm wondereing, if it's important to make the samples order exactly the same in both mentioned above files? If so, is there any easy way to do it? Will it be Ok if i'll sort my .ped file by the 1-st column by unix sort command, but will change nothing in my initial .map file?

R plink gwas • 543 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by Denis110
3
gravatar for Kevin Blighe
12 months ago by
Kevin Blighe46k
Kevin Blighe46k wrote:

Yes, this caught me off guard long ago, too, and I went crazy with it trying to figure it out.

You can use an individual sort file in order to control the ordering of your samples as it's converted from VCF / BCF to plink format, by using --indiv-sort file:

plink --noweb --bcf Caucasian.bcf --keep-allele-order --indiv-sort file CaucasianIDSort.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out Caucasian ;

The file, CaucasianIDSort.list, contains 2 columns with FID and IID:

0   17442
0   16427
0   17423
0   17443
0   16451
0   17427
0   17447
0   16456
0   17428
0   17448
0   17412

This ordering should obviously match whatever FAM file you are using, too. I always use a custom FAM and specify it in all analyses with

--fam CaucasianCustom.fam

--------------------------------

If you need to worry about family structure, then these could be encoded in your VCF / BCF as FID_IID (you can modify VCF headers with bcftools reheader). You then read these into PLINK, maintaining family information, with:

plink --noweb --bcf Families.bcf --keep-allele-order --indiv-sort file FamiliesIDsort.list --vcf-idspace-to _  --id-delim _ --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out Families ;

FamiliesIDsort.list looks like:

Fam1    16367
Fam1    16407
Fam2    16402
Fam2    16382
Fam3    16392
Fam3    16362
Fam4    16372
Fam4    16377
Fam5    16397
Fam5    16387
Fam6    16396
Fam6    16366
Fam7    16405
Fam7    16400
Fam8    16410
Fam8    16376

The sample IDs in the input VCF / BCF look like:

Fam1_16367
Fam1_16407
Fam2_16402
Fam2_16382
Fam3_16392
et cetera

Kevin

ADD COMMENTlink modified 9 months ago • written 12 months ago by Kevin Blighe46k

Hi Kevin! Many thanks for your detailed response! It's very useful. I'm wondering wether string order in .ped file somehow linked to string order of my .map file? Probably not, but i'm not 100% sure. Am i right? Besides how is it possible to convert binary files to plain text ones. I'm just learning Plink and would prefer to work with text files. Thank you again! Denis

ADD REPLYlink written 12 months ago by Denis110

Hey dude. The map file just contains information on the variants in the PED file. You could re-order the samples (rows) in the PED file but it would be wrong to re-order the columns (variants), because those are inextricably linked to the MAP file.

ADD REPLYlink written 12 months ago by Kevin Blighe46k

Hi Kevin! Thank you so much!

ADD REPLYlink written 12 months ago by Denis110

Hi Kevin! I'd like to clarify one more thing related to my post. I input into analysis phenotype data for all samples as a separate file. I followed by your advises and indicated --indiv-sort in my command line. But because of high missing genotype rate for a number of individuals, they were filtered out. As a result .ped file has a different IIDs number and its order in comparison to my phenotype data file. Will it be a problem for Plink to correctly merge genotype and phenotype data for the association test in that case?

ADD REPLYlink modified 12 months ago • written 12 months ago by Denis110
1

If you filter out variants, then it is no problem. If you filter out samples... then I would recommend updating your phenotype information, i.e., in your FAM file, and ensure that it matches the same order of FID and IID as your --indiv-sort file

It may involve going back and forward a bit...

Generally, always ensure that your phenotype files and sample sort files have the same order and content (for FID and IID). Plink does not check that the ordering in the PED is the same as per your custom files.

ADD REPLYlink written 12 months ago by Kevin Blighe46k

Thank you so much for your help Kevin, I finally managed to use plink assoc to do the test, and this time I did it for 16 samples: 8 control and 8 case. (just for testing). This is what I got:

CHR SNP BP  A1  C_A C_U A2  CHISQ   P   OR  SE  L95 U95
1   .   27022976    A   1   1   G   0   1   1   2   0.01984 50.4
1   .   27022990    A   1   NA  G   0   1   NA  NA  NA  NA
1   .   27022992    G   1   1   A   0   1   1   2   0.01984 50.4
1   .   27023030    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023035    C   NA  1   CG  0   1   NA  NA  NA  NA
1   .   27023037    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023048    A   1   NA  G   0   1   NA  NA  NA  NA
1   .   27023059    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023063    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023066    G   NA  1   A   0   1   NA  NA  NA  NA
1   rs560502657 27023071    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023072    A   NA  1   G   0   1   NA  NA  NA  NA
1   .   27023094    G   NA  1   A   0   1   NA  NA  NA  NA

This means there is no real association between disease and mutations when P values are 1 and that there is an association when P values are 0? and why is chisq always 0? maybe I need to tune something else? Thanks a lot!

ADD REPLYlink modified 7 months ago • written 7 months ago by Pin.Bioinf250

Yes, each of these p-values appears to be 1. Also, with 8 versus 8, you will find it difficult to get many statistically significant findings from this. In fact, from what I can see, all of the variants listed above are only found in 1 or 2 individuals (look at the C_A and C_U columns). An NA will be returned for the p-value if the variant / SNP is missing in a simple (these would have been encoded as ./. in the VCF, whereas homozygous ref would be encoded 0/0).

ADD REPLYlink modified 7 months ago • written 7 months ago by Kevin Blighe46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1721 users visited in the last hour