Dear all good afternoon i have example snp genotyping data like this
LOCUS POS REF ALLELE 2000 3000
MC10 713 T C NA NA
MC10 760 T C NA NA
now i want to replace MC10 with SNP1, SNP2 and so no down the file and would like to insert chr column with dummy chromosome number let 1, also would like replace A with A/A, T with T/T, G with G/G and C with C/C in both REF and ALLELE columns and also replace NA with REF column values (ex NA of 2000 column to T/T). Finally i want to concatenate LOCUS, CHR and POS columns with _ into like this SNP1_1_713. I wish to like to like have data like this
LOCUS CHR POS MAR REF ALLELE 2000 3000
SNP1 1 713 SNP1_1_713 T/T C/C T/T T/T
SNP2 1 760 SNP2_1_760 T/T C/C T/T T/T
I tried with gsub, mutate etc in tidyverse and dplyr packages and tried within function to achieve my target but unsuccessful. please find my example data here
can any one help me to get my expected results with R? anyhelp in this regard will be highly appreciated Thanks in advance
Try this code (assuming that there is no special formatting of sheet 1 values)
output from test1.xlsx:
Dear CPAD good Morning
i have some more columns after PDPE_021 and they are not identifying in R and not getting in final output i.e. test1. i am getting error like this Error in is_string(y) : object 'POLYMORPHISM' not found, this is the last and i need it in out for further processing. some times i am not getting error but NA is not converting with REF column alleles. if i run your code on above example data with out extra columns i am not getting any error. Can please let me know where it went wrong? Thanks in advance
can you update the dropbox excel sheet with exact columns, column names and dummy data? @ blacktomato27
Dear cpad0112 good morning Thanks lot for your willingness to help me. Please find below link with updated excel sheet.
https://www.dropbox.com/scl/fi/mxxixvw4rq1t88r3tvsho/New-Microsoft-Excel-Worksheet.xlsx?dl=0&rlkey=fj078mygejeh1l6g38onb09el Thanking you very much. With Kind Regards
@ blacktomato27 There is no column named "POLYMORPHISM" and I could not find such text in the excel sheet. WIth the data given in excel sheet 1, please try the following code:
Output from test1.xlsx is:
Dear cpad0112 good afternoon Thanks for your kind and prompt reply to my request. I do not know why this below step is not working on my data but it is working on example data i provided in dropbox mutate_at(vars(-c("LOCUS":"ALLELE","chr","MAR","MAF":"TOPsegsite[A/B]")),~str_replace_all(., "NA", REF)) %>% NA is not replaced by ref allele. do you think my file may have formatting problem i.e. hidden formats? is there any way to remove such format problems? anyway thanks lot for the help you provided to me. With Kind Regards.
@ blacktomato27 I cannot say any thing about the file you have. If you don't mind, please join biostars slack channel and share the file with me. My educated guess is presence of white spaces or special characters in the column names.
How did you use the gsub? the example data in the dropbox doesn't reflect the examples in your question. would you be able to edit your question so we could offer some insight?
Dear cpad0112 Good Morning
Thanks lot for your help and valuable time spent to help me. Your solution is working perfectly as per my expectation. Once again thank you very much With Kind Regards