Question

Identify overlapping SNPs

0

Entering edit mode

7.1 years ago

genogeno • 0

Hi everyone,

I have HapMAp data+another data set (totally 9 population). I will aply PCA to this data set. I merged the data sets using PLINK, --merge-list. Now, I have mergeddata.bim,mergeddata.bad,mergeddata.fam files.

How can list the overlapping SNPs in nine files in R?

And what/how should I do after I identify the overlapping SNPs?

Note: I am really new in this area and using Linux.

Thank you

R SNP • 3.9k views

ADD COMMENT • link 7.1 years ago by genogeno • 0

score 0 · Answer 1 · 2017-03-13

0

Entering edit mode

7.1 years ago

GabrielMontenegro ▴ 670

Since you merged them already the mergeddata.bim, contains the overlapping set of SNPs for your complete data set. Now this file contains:

Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
Variant identifier
Position in morgans or centimorgans (safe to use dummy value of '0')
Base-pair coordinate (normally 1-based, but 0 ok; limited to 231-2)
Allele 1 (corresponding to clear bits in .bed; usually minor)
Allele 2 (corresponding to set bits in .bed; usually major)

If you read the file in R, you can just extract the second column and that would be the list of SNPs.

If you want to do it in linux, you could do for instance: cut -f2 mergeddata.bim > snps.txt, which will extract the second column of that file.

NOTE: Before merging files it is always important to do individual QC filters on each dataset separately. There are many tutorials on this that you could find probably in this forum :)

ADD COMMENT • link 7.1 years ago by GabrielMontenegro ▴ 670

0

Entering edit mode

Thank you for your answer.

I used read.delim to get data frame from my nine bim files, then I have 9 data frame.

Then I found intersect of them.

common.snps=Reduce(intersect,list(df9[,2],df8[,2], df7[,2],df6[,2],df5[,2],df4[,2],df3[,2],df2[,2],df1[,2]))

Then I used the following command.

write.table(common.snps, file="list.snps", sep="\t", col.names=F, row.names=F, quote=F )

I found the right number of SNPs but the format of file is ASCII text. Now I should check dublicates. But I don't know how.

After I find dublicated Snps and remove them, I will do LD-prunning. How can I prepare the set of overlapping SNPs for that?

ADD REPLY • link 7.1 years ago by genogeno • 0

0

Entering edit mode

You should add this as a comment and not as a separate answer to your question. As I said in my answer your merged file mergedata.bim contains the intersect of all the SNPs. There shouldn't be any duplicated SNPs on that merged data set. Maybe check the PLINK website and explore the merge command that you use to see what it does.

ADD REPLY • link 7.1 years ago by GabrielMontenegro ▴ 670

0

Entering edit mode

Thank you very much. I did it as you said. I have a snps.txt file now. Then, I should do LD-prunning.I will try to prune out SNP which has low r2. I guess I need some parameter. How can I define them?

ADD REPLY • link 7.1 years ago by genogeno • 0

0

Entering edit mode

I'm glad it worked. How to prune SNP will depend on what you want to do with the pruned data ? Is it to do a PCA for instance? In that case, this parameter is common: --indep-pairwise 50 10 0.2 However, this question is not related to your original post and you should either post a new one or try to look for the answer by googling it :)

ADD REPLY • link 7.1 years ago by GabrielMontenegro ▴ 670

0

Entering edit mode

Actually, LD prunning is completed. I used --indep-pairwise 50 5 0.2 . Thank you very much for your suggestions.

I asked another question about the continuation of this topic in other posts. Maybe you can help out there :)