Question

Many SNPs getting excluded during PRSice 2 analysis

0

Entering edit mode

2.4 years ago

m.shoaib • 0

Hello I am running PRSice 2

But there are many variants info which is not included during the process

My code:

Rscript PRSice.R\
    --prsice PRSice_linux\
    --dir /home/projects/base-files/T2test/PRSice\
    --base T2baseNoBMI.uniq.txt\
    --target contn.analysis.PRS#\
    --pheno T2.train.control60.txt\
    --cov covariates.cont60.txt\
    --binary-target T\
    --snp rsID\
    --chr CHR\
    --bp BP\
    --A1 A1\
    --A2 A2\
    --stat BETA\
    --pvalue Pvalue\
    --extract PRSice.valid\
    --print-snp\
    --score sum\ 
    --out PRSice-res\

My log file suggest that 1476197 SNPs are not being included,

5965221 variant(s) observed in base file, with:
1310476 variant(s) excluded based on user input
4654745 total variant(s) included from base file

Loading Genotype info from target
==================================================

187786 people (86770 male(s), 101016 female(s)) observed
187786 founder(s) included

1476197 variant(s) not found in previous data
226 variant(s) with mismatch information
4654745 variant(s) included

The first column in my base file has effect allele column A1 and second is reference allele A2. In the target data which is in the form of bed bim fam, the first column is ref allele and the second one effect allele. Does this causing the exclusion of so many SNPs ? How to fix this

Or is there something else I am missing

Thanks a lot

PRSice2 UKBB • 1.6k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 2.4 years ago by m.shoaib • 0

score 2 · Accepted Answer · 2021-11-26

2

Entering edit mode

2.4 years ago

Sam ★ 4.7k

Simply put, the rsid or snp id in your GWAS was not found in your target. There’s still a very healthy amount of overlap, so I won’t be too concern

ADD COMMENT • link 2.4 years ago by Sam ★ 4.7k

0

Entering edit mode

Ok, so you think excluding 1.5 million varaint wont make much difference since I already have almost 4.5 million varaint considered by PRSice. Actually, I considered external GWAS that has only chromosome and base position columns. In order to get rsIDs I matched base positions from my target data with rsIDs (from UKBB). So, base and target should have same number of matching rsIDs but I am still confused why PRSice is unable to process milllion of SNPs in my analysis.

ADD REPLY • link 2.4 years ago by m.shoaib • 0

0

Entering edit mode

So your base data has chr:bp format, and your target has rsid. from the number of SNPs. You can check the overlap between the two using R

library(data.table)
base <- fread("T2baseNoBMI.uniq.txt")
target <- NULL
for(i in paste("contn.analysis.PRS",1:22,".bim", sep="")){
    target <- rbind(target, fread(i))
}
target[,chrID:= paste(V1, V4, sep=":")]

table(target[,chrID], base[,rsID])
table(base[,rsID], target[,chrID] )

and see what numbers you are getting. I am assuming your rsID column in your base is the original ID.

ADD REPLY • link 2.4 years ago by Sam ★ 4.7k