Hi,
I have parsed my vcf files containing SNPs as below
CHROMOSOME POSITION REF ALT SAMPLE
1 782112 G A LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1026918 C T LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1133283 C T LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1431511 G A LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1742395 C T LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1864994 G A LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1 1914766 C T LP6005334-DNA_H01_vs_LP6005333-DNA_H01
But I have duplicate mutation because for example in this sample
~$ grep 152280536 file.txt
1 152280536 T C LP6008334-DNA_C02_vs_LP6008335-DNA_G01
1 152280536 T C LP6008334-DNA_C02_vs_LP6008335-DNA_G01
I am not sure in which step of data processing and how I could removing duplicated mutations.
Any help please
There are a million ways to do this that are a simple google away:
"unix remove duplicated lines"e.g.
https://unix.stackexchange.com/questions/30173/how-to-remove-duplicate-lines-inside-a-text-file
Use bash uniq:
sort | uniqis not really necessary whensort -uis an option, I think.Sorry, but some of these solutions destroyed my file
For example
by
R says that
or the command flags non duplicates too
Invest some time in understanding what the options you’ve been given actually do. Don’t just blindly copy from the web.
sorthas a lot of capabilities when used well. You might wish to set sorting keys (columns to sort with) using the-k M,Noption. You can sort by columns that will be the same among identical rows and then use the-uoption to pick only unique lines.If you read in the file as a
data.frameordata.tableinto R (!), you can useunique(my_dataframe), no need to sort, if I recall correctly (but it may take some time).Which function of R returns the error message; it's somewhat difficult to believe it's related to the sorting and uniq'ing though.
You are right
I was using
dndscvr package by data underuniqueor any suggested command for removing duplication that returned errorI have also tried this likely amended data by another python packages like OncodriveCLUSTL and OncodriveFML and returned error like
I do not see the usage of unique in your example.
Does that mean the error you're reporting ("X mutations have a wrong reference base") is independent of (any) of the unique commands? Did you check the help of
dndscv()?Everything happened when I tried to remove duplicates; I run these package with original file successfully