Question: Removing duplicated mutations from a txt
1
gravatar for F
4 weeks ago by
F3.4k
Iran
F3.4k wrote:

Hi,

I have parsed my vcf files containing SNPs as below

CHROMOSOME  POSITION    REF ALT SAMPLE
1   782112  G   A   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1026918 C   T   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1133283 C   T   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1431511 G   A   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1742395 C   T   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1864994 G   A   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1914766 C   T   LP6005334-DNA_H01_vs_LP6005333-DNA_H01

But I have duplicate mutation because for example in this sample

~$ grep 152280536 file.txt 
 1    152280536    T    C    LP6008334-DNA_C02_vs_LP6008335-DNA_G01
 1    152280536    T    C    LP6008334-DNA_C02_vs_LP6008335-DNA_G01

I am not sure in which step of data processing and how I could removing duplicated mutations.

Any help please

mutation dndscv wgs vcf • 361 views
ADD COMMENTlink modified 4 weeks ago by zx87546.8k • written 4 weeks ago by F3.4k
1

There are a million ways to do this that are a simple google away:

"unix remove duplicated lines"

e.g.

https://unix.stackexchange.com/questions/30173/how-to-remove-duplicate-lines-inside-a-text-file

ADD REPLYlink written 4 weeks ago by jrj.healey11k
1

Use bash uniq:

sort -r file.txt | uniq > output.txt
ADD REPLYlink written 4 weeks ago by ahaswer140
1

sort | uniq is not really necessary when sort -u is an option, I think.

ADD REPLYlink written 4 weeks ago by RamRS20k

Sorry, but some of these solutions destroyed my file

For example

by

 sort -r file.txt | uniq > output.txt

R says that

 2178 (76%) mutations have a wrong reference base. Please confirm that you are not running data from a different assembly or species.

or the command flags non duplicates too

(py3venv) [fi1d18@cyan01 ~]$ awk '!a[$3]++' fil.txt
CHROMOSOME      POSITION        REF     ALT     SAMPLE
1       782112  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1       1026918 C       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1       2753944 A       G       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1       2856299 T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
(py3venv) [fi1d18@cyan01 ~]$ awk '!a[$1]++' fil.txt
CHROMOSOME      POSITION        REF     ALT     SAMPLE
1       782112  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
10      611690  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
11      445671  C       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
12      243344  T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
13      19025167        A       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
14      19075451        T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
15      20014428        A       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
16      442771  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
17      45324   T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
18      85012   G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
19      709913  G       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
2       404136  T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
20      215046  C       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
21      9651465 T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
22      16873416        A       G       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
3       160445  G       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
4       207726  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
5       213846  C       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
6       300572  A       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
7       324430  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
8       187201  T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
9       846272  A       G       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
X       419002  C       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
Y       5531500 G       A       LP6007600_vs_LP6007599
(py3venv) [fi1d18@cyan01 ~]$
ADD REPLYlink written 4 weeks ago by F3.4k
1

Invest some time in understanding what the options you’ve been given actually do. Don’t just blindly copy from the web.

ADD REPLYlink written 4 weeks ago by jrj.healey11k

sort has a lot of capabilities when used well. You might wish to set sorting keys (columns to sort with) using the -k M,N option. You can sort by columns that will be the same among identical rows and then use the -u option to pick only unique lines.

ADD REPLYlink written 4 weeks ago by RamRS20k

If you read in the file as a data.frame or data.table into R (!), you can use unique(my_dataframe), no need to sort, if I recall correctly (but it may take some time).

Which function of R returns the error message; it's somewhat difficult to believe it's related to the sorting and uniq'ing though.

ADD REPLYlink written 4 weeks ago by Friederike3.3k
1

You are right

I was using dndscv r package by data under unique or any suggested command for removing duplication that returned error

> dndsout = dndscv(mut)
[1] Loading the environment...
[2] Annotating the mutations...
    Note: 3800 mutations removed for exceeding the limit of mutations per gene per sample
Error in dndscv(mut) : 
  2178 (76%) mutations have a wrong reference base. Please confirm that you are not running data from a different assembly or species.

I have also tried this likely amended data by another python packages like OncodriveCLUSTL and OncodriveFML and returned error like

position = int(line['POSITION'])
ValueError: invalid literal for int() with base 10: 'POSITION'
ADD REPLYlink written 4 weeks ago by F3.4k

I do not see the usage of unique in your example.

Does that mean the error you're reporting ("X mutations have a wrong reference base") is independent of (any) of the unique commands? Did you check the help of dndscv()?

ADD REPLYlink written 4 weeks ago by Friederike3.3k

Everything happened when I tried to remove duplicates; I run these package with original file successfully

ADD REPLYlink written 4 weeks ago by F3.4k
1
gravatar for Friederike
4 weeks ago by
Friederike3.3k
United States
Friederike3.3k wrote:

then:

  1. read in the original file into R
  2. use unique(df)
  3. use that dndscv function

and show the entire code and error messages here.

ADD COMMENTlink written 4 weeks ago by Friederike3.3k

Sorry, after using unique(df) software worked, thank you

ADD REPLYlink written 4 weeks ago by F3.4k
1
gravatar for zx8754
4 weeks ago by
zx87546.8k
London
zx87546.8k wrote:

There error message from dndscv is not about unique issue. Do not remove duplicates, dndscv keeps only one mutation per gene, regardless of number of sample numbers.

The error is about the genome assembly/species, i.e., positions and ref/alts do not match with dndscv database. Because it is not the same genome, dndscv is discovering more mutations than expected: "exceeding the limit of mutations per gene"

From manuals, ensure you are using "GRCh37/hg19 human genome":

By default, dNdScv assumes that mutation data is mapped to the GRCh37/hg19 assembly of the human genome. Users interested in trying dNdScv on a different set of transcripts, a different assembly or a different species can follow this tutorial.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by zx87546.8k
3

Hello. Just a small clarification regarding the sentence:

Do not remove duplicates, dndscv keeps only one mutation per gene, regardless of number of sample numbers

dNdScv does not only keep one mutation per gene. By default, the dndscv function keeps up to 3 mutations per gene per sample. This can be changed using the optional parameter "max_muts_per_gene_per_sample" (see: ? dndscv). The reason for this is to protect dndscv against an inadequate annotation of clusters of mutations by the user. For example, imagine that a simple kataegis event introduces 10 mutations in a gene in a given sample. This may be look like high mutation recurrence in a gene, while in reality this is just a single event. Such events should not be annotated as independent mutations but as a complex mutation, but the "max_muts_per_gene_per_sample=3" by default protects the method against misannotation.

So, dndscv does treat each row in the mutation file as an independent mutation, and if you have duplicate entries in the mutation file, these mutations will be double-counted, which will affect the statistics of the model. That is why dndscv issues a warning when it detects multiple identical mutations. Using unique(df) is a good idea if the dataset contains duplicated rows.

ADD REPLYlink written 27 days ago by Inigo Martincorena170

Thanks a lot

I used unique(mutations)

Then

> dndsout = dndscv(mutations)
[1] Loading the environment...
[2] Annotating the mutations...
    Note: 6 mutations removed for exceeding the limit of mutations per gene per sample
[3] Estimating global rates...
[4] Running dNdSloc...
[5] Running dNdScv...
    Regression model for substitutions (theta = 3.23).
    Regression model for indels (theta = 0.123)
Warning messages:
1: In dndscv(total) :
  Mutations observed in contiguous sites within a sample. Please annotate or remove dinucleotide or complex substitutions for best results.
2: In dndscv(total) :
  Same mutations observed in different sampleIDs. Please verify that these are independent events and remove duplicates otherwise.
> 
>

I know I have duplicated mutations in same samples, as I am too bad in coding I just hope using unique in R removes duplicates correctly

However, I am not sure how to avoid this warning

Mutations observed in contiguous sites within a sample. Please annotate or remove dinucleotide or complex substitutions for best results.
ADD REPLYlink written 27 days ago by F3.4k
1

unique() will only remove lines that are absolutely identical. The warning message indicates that the putative duplicated entries are not entirely identical because they contain different sample ID entries.

I don't know what your data.frame looks like, but if you are sure you want to remove the presumed duplicates (1), you might be able to simply assign the sample ID to the rownames of the data.frame and subsequently ignore the sample ID column from the unique call:

> df <- data.frame(sample = c("A","B","C"), value = c(10,10,100), mutation = c("x","x","y") )
> df
 sample value mutation
1      A    10        x
2      B    10        x
3      C   100        y
> unique(df) ## won't remove anything because all lines are unique thanks to the sample ID
  sample value mutation
1      A    10        x
2      B    10        x
3      C   100        y
> row.names(df) <- df$sample
# now, ignore the sample ID column
> unique(df[, c("value","mutation")])
  value mutation
A    10        x
C   100        y

Note how this will leave you completely blind to which samples that mutation was found in!

If you wanted to somewhat keep track of that, this might be a way (sorry, don't know of a data.frame way although I'm sure there is one):

> library(data.table)
> library(magrittr)
> data.table(df) %>% .[, paste(sample, collapse = ","), mutation]
   mutation  V1
1:        x A,B
2:        y   C

(1) Are you sure that these are duplicates you want to remove? Maybe these mutations are simply fairly common and can be found in numerous samples and are totally legit!

ADD REPLYlink modified 27 days ago • written 27 days ago by Friederike3.3k
2

Friederike is correct that unique(df) will only remove duplicated rows, which may be your problem (as you say), in which case using unique(df) you will have corrected it. The warning by dndscv is slightly confusing since it will also be triggered by duplicated rows (I should make it clearer).

Friederike is also correct that identical mutations, particularly driver mutations (e.g. BRAF V600E), can appear in multiple samples. The warning only serves to remind the user to make sure that every entry in the mutation table is an independent mutation. Two identical mutations in the cancers of two different patients are genuinely independent events and should not be collapsed into one. However, duplicated rows or multiple observations of the same mutation in phylogenetically-related areas of the same tumour should be counted only once. It is up to the user to ensure that all entries in the table are independent.

ADD REPLYlink written 27 days ago by Inigo Martincorena170

Sorry, you are mentioning what exactly makes me worrying about removing duplicates as I am not sure what I am removing

This is link of my whole data

https://www.dropbox.com/s/nf0kuo8573zhnio/indel%20%281%29.txt?dl=0 I saw after using unique I am removing a lot of data though

ADD REPLYlink written 27 days ago by F3.4k
2

I've shown you all the commands to explore this issue...

> library(magrittr)
> library(data.table)
> indels <- fread("indels.txt", fill = TRUE)
> dim(indels)
[1] 24045     5

> dim(unique(indels))
[1] 22549     5 # there are 1496 true duplicates (= identical lines)

>  indels <- unique(indels)
> indels[, paste(sampleID, collapse = ","), by = c("chr","pos","ref","mut")]
       chr       pos ref mut                                     V1
    1:   1     56987  CA   C LP6005334_DNA_H01_vs_LP6005333_DNA_H01
    2:   1   4133415  AG   A LP6005334_DNA_H01_vs_LP6005333_DNA_H01
    3:   1   4611030   A  AC LP6005334_DNA_H01_vs_LP6005333_DNA_H01
    4:   1   4872820   G  GA LP6005334_DNA_H01_vs_LP6005333_DNA_H01
    5:   1   6092644   A  AG LP6005334_DNA_H01_vs_LP6005333_DNA_H01
   ---                                                             
22425:   X 128553587   A ATT LP6008334_DNA_C02_vs_LP6008335_DNA_G01
22426:   X 130518455   A  AT LP6008334_DNA_C02_vs_LP6008335_DNA_G01
22427:   X 132416170  CA   C LP6008334_DNA_C02_vs_LP6008335_DNA_G01
22428:   X 136291349  AT   A LP6008334_DNA_C02_vs_LP6008335_DNA_G01
22429:   X 145516886  AT   A LP6008334_DNA_C02_vs_LP6008335_DNA_G01
# --> there's 22429 left after this unique step ignoring the sample IDs, i.e. about 
# 120 mutations that are found in more than one sample

# to see which mutations are found in more than one sample, you can grep for the comma that
# we add in the paste command
> indels[, paste(sampleID, collapse = ","), by = c("chr","pos","ref","mut")] %>% .[ grepl(",", V1)]
     chr       pos ref                       mut
  1:   1 215653764  AT                         A
  2:   2  40893359  GT                         G
  3:   2  78324046 CAA                         C
  4:   2 119219369  TA                         T
  5:   7  14143149  CA                         C
 ---                                            
114:  13  54027680  GT                         G
115:   X 131481051   A                        AC
116:  11  49527629   T                        TA
117:  13  46417165   C CTGTGTGTGTGTGTGTGTGTGTGTG
118:  19  56581430   C                        CT
                                                                                V1
  1: LP6005334_DNA_H01_vs_LP6005333_DNA_H01,LP6005935_DNA_G02_vs_LP6005979_DNA_C01
  2: LP6005334_DNA_H01_vs_LP6005333_DNA_H01,LP6008334_DNA_D02_vs_LP6008335_DNA_H01
  3: LP6005334_DNA_H01_vs_LP6005333_DNA_H01,LP6005935_DNA_G02_vs_LP6005979_DNA_C01
  4: LP6005334_DNA_H01_vs_LP6005333_DNA_H01,LP6008460_DNA_G03_vs_LP6008340_DNA_C05
  5: LP6005334_DNA_H01_vs_LP6005333_DNA_H01,LP6008336_DNA_H01_vs_LP6008333_DNA_H01
 ---                                                                              
114: LP6008460_DNA_F02_vs_LP6008340_DNA_E05,LP6008334_DNA_C02_vs_LP6008335_DNA_G01
115: LP6008460_DNA_F02_vs_LP6008340_DNA_E05,LP6008334_DNA_C02_vs_LP6008335_DNA_G01
116: LP6008460_DNA_G03_vs_LP6008340_DNA_C05,LP6008334_DNA_C02_vs_LP6008335_DNA_G01
117: LP6008460_DNA_G03_vs_LP6008340_DNA_C05,LP6008334_DNA_C02_vs_LP6008335_DNA_G01
118: LP6008460_DNA_G03_vs_LP6008340_DNA_C05,LP6008334_DNA_C02_vs_LP6008335_DNA_G01

So, now you do know what you're losing. For example, the insertion of CTGTGTGTGTGTGTGTGTGTGTGTG on chr 13 is found in samples LP6008460_DNA_G03_vs_LP6008340_DNA_C05 and LP6008334_DNA_C02_vs_LP6008335_DNA_G01.

The first run of unique should definitely be kept to remove the true duplicated lines (of identical content); whether you want to keep or remove the duplicated mutations is really a decision you have to based on what the final question is.

ADD REPLYlink modified 27 days ago • written 27 days ago by Friederike3.3k

Thank you to bearing with me

ADD REPLYlink written 27 days ago by F3.4k

Sorry, but I am sure error of dndscv is because of trying to remove duplicate because original file works well, no doubt

ADD REPLYlink written 4 weeks ago by F3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1578 users visited in the last hour