Question

Subsetting individuals with Plink. Error: Line 1 of --keep file has fewer tokens than expected

2

Entering edit mode

8.2 years ago

msimmer92 ▴ 310

I have files with the 2504 individuals of the 1000 genomes project, and I want to filter by population. I did the following for the first population (ACB):

plink --file all1000gen --keep indACB.txt --make-bed --out all1000genACB

but it gives back the following error:

Error: Line 1 of --keep file has fewer tokens than expected.

my indACB.txt file looks like this:

head indACB.txt 
HG01879
HG01880
HG01882
HG01883
HG01885
HG01886
HG01889
HG01890
HG01894
HG01896

which I made (por each population, using grep) from the population information file that's available in the 1000 genomes page, which has a two times the individual ID (first two columns) and one with the population name, as shown:

head indpop2.txt
HG00096 HG00096 GBR
HG00097 HG00097 GBR
HG00099 HG00099 GBR
HG00100 HG00100 GBR
HG00101 HG00101 GBR
HG00102 HG00102 GBR
HG00103 HG00103 GBR
HG00105 HG00105 GBR
HG00106 HG00106 GBR
HG00107 HG00107 GBR

I think there's a problem with my --keep file, but I'm not sure what's the wanted structure of the txt file.

I also tried greping ACB individuals from indpop2.txt , so the new indACB.txt file looks like this:

head indACB2.txt 
HG01879 HG01879 ACB
HG01880 HG01880 ACB
HG01882 HG01882 ACB
HG01883 HG01883 ACB
HG01885 HG01885 ACB
HG01886 HG01886 ACB
HG01889 HG01889 ACB
HG01890 HG01890 ACB
HG01894 HG01894 ACB
HG01896 HG01896 ACB

But it yields the following error:

plink --file allconcat39 --keep indACB2.txt --make-bed --out allconcat43ACB

Error: No people remaining after --keep.

plink • 17k views

ADD COMMENT • link updated 8.2 years ago by willgilks ▴ 360 • written 8.2 years ago by msimmer92 ▴ 310

score 6 · Accepted Answer · 2017-05-05

6

Entering edit mode

8.2 years ago

willgilks ▴ 360

from https://www.cog-genomics.org/plink/1.9/filter

--keep accepts a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column, and removes all unlisted samples from the current analysis.

So like this:

HG00096 HG00096
HG00097 HG00097
HG00099 HG00099
HG00100 HG00100
HG00101 HG00101
HG00102 HG00102
HG00103 HG00103
HG00105 HG00105
HG00106 HG00106
HG00107 HG00107