PRS in UK Biobank - no covariate file and no phenotype file
3
1
Entering edit mode
12 days ago
agamemnon • 0

Hi there, I am trying to undertake a PRS using UK Biobank plink data. I am trying to generate a PRS using PRSice-2. However, the issue I am having is that I do not have a covariate file nor a phenotype file. I would like to know, how to generate them.

Thanks

UK Biobank PRS • 489 views
0
Entering edit mode
12 days ago
Sam ★ 3.8k

You should at least have a phenotype of interest for you to work on. If not, then you need to better define what you are trying to do for us to help you.

Depends on your phenotype, you will usually include the PCs, Genotyping batch, Accessment centre, and maybe sex and age. All of those information should come with your UK biobank application. I am not sure how your UK biobank data were organized so it is rather difficult for me to give direct advice. A more general guide can be found here: https://choishingwan.gitlab.io/ukb-administration/

0
Entering edit mode
12 days ago
agamemnon • 0

Hi Sam,

I have access to two different target data-sets. The first data-set has .bed .bim .fam files. The second has .bed .bim .bgen .bgen.bgi. The latter doesn't have .fam so I can't run the QC for the second dataset. There is no covariate file. The phenotype I am trying to look at is parkinson's disease.

0
Entering edit mode

If not already included in your ukb application, you can update your data basket to include for example the ICD10 codes (Data-Field 41202), and retrieve the subjects having parkinson's disease. Then make your phenotype file yourself.

0
Entering edit mode

There should be a .sample file for your bgen files, which act as thefam file for your bgen.

As for covariate, it should always come with your application if you have access to the genotype data. You just need to extract them from the phenotype file. For example, PCs has a field ID of 22009 (40 Arrays, one for each PC), genotype batch is field 22000, sex is 31 age is 21003 and assessment centre is 54. There are multiple ICD fields, and you might have to search for them yourselves (too lazy to type them all out)

0
Entering edit mode

To clarify, I would stratify the cohort according to age, gender, ethnicity, genotype batch, etc?

To reduce confounding, how would you use the data from the multiple field ID in a PRSice pipeline?

0
Entering edit mode

You would include those information as a covariate. For PRSice, that will be the --cov parameter. And for things that are coded as factor (e.g. batch and centre), you should provide them through --cov-factor

0
Entering edit mode

Thank you, I have the right target data and will also extract the phenotype and covariate data.

While running the first script for QCing the target data. I got the following "error".

7402791 variants loaded from .bim file.
487409 people (0 males, 0 females, 487409 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /path/to/file


Am I missing something or why is the gender ambiguous for the UK biobank dataset? Has it something to do with the fact that each chromosome has it's own files?

The code i am running is as follows:

plink \
--bfile ~/path/to/file/ukb_imp_chr1 \
--maf 0.01 \
--hwe 1e-6 \
--geno 0.01 \
--mind 0.01 \
--write-snplist \
--make-just-fam \
--out ~/path/to/file/ukb_imp_chr1.QC

0
Entering edit mode

UK Biobank did not store the sex information to the fam file. You will need to extract those from the phenotype data base.

0
Entering edit mode
3 days ago
agamemnon • 0

I have access to the phenotype dataset, and extracted field 31 (gender) and 21003 (age), I want to clarify the headers for the files e.g. my 31.csv has the following header eid and 31-0.0 and similar header for the age file, do I have to rename the headers and just pass them through as -cov 31.csv 21003.csv?

For example:

plink \
--bfile ~/path/to/file/ukb_imp_chr1 \
--maf 0.01 \
--hwe 1e-6 \
--geno 0.01 \
--mind 0.01 \
--write-snplist \
--make-just-fam \
--cov ~/31.csv ~/21003.csv \
--out ~/path/to/file/ukb_imp_chr1.QC