I tried the following command to extract a bfile using a given snp list and a given subject ID list:
plink2 \
--bgen xxx.bgen \
--sample xxx.sample \
--extract input.snp_list \
--make-bed \
--out output
The input.snp_list is the second column from some other input.bim file. I hope to have alleles in the output.bim identical to the input.bim, but it's not case in the output.bim right now, because there are duplicate snps in the output.bim (like below),
19 rs75617501 0 44544721 T C
19 rs75617501 0 44544721 T G
and
19 rs573790568 0 44678292 TTG T
19 rs573790568 0 44678292 T TTGTG
19 rs573790568 0 44678292 T TTGTGTG
and rs75617501 and rs573790568 in the input.bim had no duplicates, and their corresponding alleles are
19 rs75617501 0 44544721 C T
and
19 rs573790568 0 44678292 TTG T
So I wonder if there is a way to remove the duplicated snps when extracting bfile so that only snps with alleles matching input.bim are kept. For example, after removing the duplicate snps I would only have the following in my output.bim:
19 rs75617501 0 44544721 T C
19 rs573790568 0 44678292 TTG T
Thank you!
Since you are using plink, you can update SNP (rsid) with POS:A1:A2 in the plink files and later extract the SNP with the alleles you want. I hope this makes sense.
Thank you! Could you give an example of the 'update' command? I think it should be one of the commands listed in https://www.cog-genomics.org/plink/1.9/data but unsure about which command to use.
You should use the following command in this page.
Thank you! I still have a question: rsID.lst seems to be a two-column file consisting of rsID and POS, so I'm not sure how the duplicated snps will removed according to the alleles in
input.bim. I wanted to generate a bfile so that among all duplicated snps, only the snp with alleles exactly matching with input.bim is kept (for example, in theinput.bimwe have '19 rs573790568 0 44678292 TTG T', so in the final bfile, only '19 rs573790568 0 44678292 TTG T' is kept, and '19 rs573790568 0 44678292 T TTGTG' and '19 rs573790568 0 44678292 T TTGTGTG' is discarded).This is usually done with plink 2.0's --set-all-var-ids flag (https://www.cog-genomics.org/plink/2.0/data#set_all_var_ids ).