Question

Using Plink, get allele frequencies for a subset of individuals

0

Entering edit mode

6.6 years ago

Mr Locuace ▴ 160

Using Plink, I would like to calculate allele frequencies for a subset of individuals (cases) from a total cohort of 188 individuals (92 cases + 96 controls). I have proper .ped and .map files.

I have tried multiple options but I am not able to subset data. These are two typical ways I have tried:

1)

./plink --file /path/in_data --chr 1-22 --allow-extra-chr --filter  /path/cases.raw 1 --freq --make-bed --out out_data_cases

The .raw file has this format:

CAV-001 CAV-001 1
CAV-002 CAV-002 1
CAV-003 CAV-003 0
CAV-004 CAV-004 1

where the first column is Family ID, second column Individual ID, and in third column 1 are cases and 0 controls. I want to subset cases. All Family ID = Individual ID

2)

./plink --file /path/in_data --chr 1-22 --allow-extra-chr --keep  /path/cases.txt --freq --make-bed --out out_data_cases

The cases.txt file includes columns 1 and 2 from the .raw file.

This is what I get in the .log file (some paths are not shown):

16384 MB RAM detected; reserving 8192 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (868263 variants, 188 people).
....
868263 variants loaded from .bim file.
188 people (0 males, 0 females, 188 ambiguous) loaded from .fam.
Ambiguous sex IDs written to xxxx.nosex
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 188 founders and 0 nonfounders present.
Calculating allele frequencies... done.
--freq: Allele frequencies (founders only) written to xxxx.frq
868263 variants and 188 people pass filters and QC.
Note: No phenotypes present.
--make-bed to ....

My question is similar as Cannot remove subjects from Plink files but I have tried what they suggest there, without positive outcome. Please help !

Plink Subset individuals • 5.4k views

ADD COMMENT • link updated 6.6 years ago by pfs ▴ 280 • written 6.6 years ago by Mr Locuace ▴ 160

score 1 · Answer 1 · 2017-09-25

1

Entering edit mode

6.6 years ago

Kevin Blighe 87k

My guess is an encoding issue with your hyphens. PLINK may have converted these to something else when you read your data into PLINK initially - have you checked the sample names as they appear in PLINK?

Also check the formatting of the hyphen in your /path/cases.txt file. There are different encodings of hyphens in ASCII.

ADD COMMENT • link 6.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you @Kevin Blighe. Plink should recognize the hyphens form the individuals names of the .ped file, as I am able to get the allele frequencies for the whole cohort but not for the cases subset. I did a trial just doing copy & paste from a few individuals from the .fam file (and also from the .ped file) and I get the same output.

ADD REPLY • link 6.6 years ago by Mr Locuace ▴ 160

0

Entering edit mode

Yes, you're correct that it should accept hyphens.

I realise that I have never actually used --keep in PLINK. For sample filtering, I have always conducted it outside of PLINK when I have my data in BCF or VCF format, and I filter with a command like this:

bcftools view -Ob --samples-file KeepCases.list MyData.bcf > MyDataCases.bcf

bcftools index MyDataCases.bcf

I then convert these BCFs to PLINK format (> PLINK 1.9 required)

plink --noweb --bcf MyDataCases.bcf --keep-allele-order --indiv-sort file KeepCases.IDSort.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyDataCases

KeepCases.list is jst a single column file of sample IDs KeepCases.IDSort.list is the standard 2-column format, as per your own file. It instructs PLINK to read in the data and maintain the sample ordering as per KeepCases.IDSort.list

I don't know if this is an option for you or not!

ADD REPLY • link 6.6 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you @Kevin Blighe, I will try that. I have Plink 1.9

ADD REPLY • link 6.6 years ago by Mr Locuace ▴ 160

score 1 · Answer 2 · 2017-09-25

1

Entering edit mode

6.6 years ago

Maxime Lamontagne ★ 2.3k

Create a file with only the FID and IID of the subjects you want to keep (NewFile.txt). Then try this :

./plink --file /path/in_data --chr 1-22 --allow-extra-chr --keep NewFile.txt --pheno /path/cases.txt --freq --make-bed --out out_data_case

ADD COMMENT • link 6.6 years ago by Maxime Lamontagne ★ 2.3k

0

Entering edit mode

Thank you @Maxime Lamontagne. I got the same output, just a .log file and a .nosex file with the 188 individuals. In addition, I got an error of a supposedly Duplicate ID (I check the files and I don't have duplicates):

Error: Duplicate ID 'CAV-131 CAV-131'.

ADD REPLY • link 6.6 years ago by Mr Locuace ▴ 160

0

Entering edit mode

Try this :

./plink --file /path/in_data --chr 1-22 --allow-extra-chr --keep NewFile.txt --make-bed --out Temp1 --allow-no-sex

./plink --bfile Temp1 --pheno /path/cases.txt --freq --out out_data_case --allow-no-sex

ADD REPLY • link 6.6 years ago by Maxime Lamontagne ★ 2.3k

0

Entering edit mode

Thank you very much @Maxime Lamontagne. Now it partially works. Somehow I get the allele frequencies for 5 individuals, but not for the 92 cases and my NewFile.txt has 92 individuals.

ADD REPLY • link 6.6 years ago by Mr Locuace ▴ 160

0

Entering edit mode

I gues one of your file is incorrectly formatted. If you want, I could take a quick look at them.

ADD REPLY • link 6.6 years ago by Maxime Lamontagne ★ 2.3k

score 0 · Answer 3 · 2017-09-25

0

Entering edit mode

6.6 years ago

pfs ▴ 280

Your -keep file format is wrong. Create one -keep file that only includes the people with a '1' identifier. Then run these two commands.

./plink --file /path/in_data --keep new_file.txt --freq --out control
/plink --file /path/in_data --remove new_file.txt --freq --out case

ADD COMMENT • link 6.6 years ago by pfs ▴ 280

0

Entering edit mode

Thank you @pfs, but I already mentioned that I did what you suggest in 2). Either you use --keep without identifier (.txt file with 2 columns) or you use --filter with identifier (.raw file with 3 columns, the third one has the identifier). Even if I use the --filter with only cases (1s), I get the same output.

ADD REPLY • link 6.6 years ago by Mr Locuace ▴ 160

0

Entering edit mode

Can you double check the path to the --keep file and make sure this file has the expected number of lines (wc -l keep_file). What version of Plink are you running?

ADD REPLY • link 6.6 years ago by pfs ▴ 280