Finding the largest possible combination of mutations
2
1
Entering edit mode
8.4 years ago
mkulecka ▴ 360

I have a large dataset, with about 81 tumors and 10 000 mutations. I would like to know largest possible number of shared mutations between tumors

        tum1 tum2 tum3 tum4
mut1    1    1    0    1
mut2    1    1    0    1
mut3    0    0    1    1
mut4    1    1    0    1

For instance in the example above the largest possible number of mutations is three between tumors 1, 2 and 4. The problem is I would like to know how to do this in computationally effective way - the question is non-trivial for larger datasets. I tried using powerset in python (https://docs.python.org/2/library/itertools.html#recipes) but it's not computationally effective.

next-gen programming algorithms • 1.8k views
ADD COMMENT
1
Entering edit mode

Hi,

I'm wondering if the possible values in your matrix are only 0 or 1 ? If so, have your tried to consider your matrix as a vector of binary numbers that can be bitwise compared among themselves ?

Can we have a bit more details on the way you perform this actually ?

ADD REPLY
0
Entering edit mode

I have created a powerset from mutations which appeared more than once and checked if a given combination is present in more than one sample - as you can imagine this is not computationally effective,

In this case, the possible values in matrix are only 0 and 1, that is correct. However, I have no idea how to implement what you are suggesting in python or R.

ADD REPLY
1
Entering edit mode
8.4 years ago

In R:

# produce a matrix of mutations (logical matrix),
# samples in columns, mutations in rows
mutations = matrix(rnorm(100)>1,ncol=10)

If we convert our matrix to a data frame, we can do things like this:

mutationsdf = data.frame(mutations)
Reduce('&',mutationsdf[,c(1,2,3)])

This last line will give us a logical vector of the mutations present in all three of the first three samples. Change the sample numbers as you see fit.

ADD COMMENT
0
Entering edit mode
8.4 years ago
Thibault D. ▴ 700

In Python (2 or 3), given the following table:

      tum1 tum2 tum3 tum4
mut1    1    1    0    1
mut2    1    1    0    1
mut3    0    0    1    1
mut4    1    1    0    1

We can write each sample as binary numbers: tum1 = 1101 (binary) = 13 (decimal). Therefore, we can use Bitwise Operators to compare those numbers:

# Original vectors
tum1 = int('1101', 2)
tum2 = int('1101', 2)
tum3 = int('0010', 2)
tum4 = int('1111', 2)
# Let's print tum1
tum1
> 13
#Bitwise comparison
bin(tum1 & tum2)[2:] 
> '1101'
bin(tum1 & tum3)[2:]
> '0001'
bin(tum1 & tum4)[2:]
> '1101'

A '0' at the position 'n' means the two conditions does not share the mutation 'n'. On the contrary, a '1' at the 'n'th position means the two conditions shares the mutation 'n'.

We have now the common mutations among the tumors. However, we would like to know which one has the highest number of common mutation. So let's consider the following Python command line. It sums our previous vectors:

sum(map(int, bin(tum1 & tum2)[2:]))
> 3
sum(map(int, bin(tum1 & tum3)[2:]))
> 0
sum(map(int, bin(tum1 & tum4)[2:]))
> 3

We can see that the couples (tum1, tum2) and (tum1, tum4) have 3 mutations in common, and that the couple (tum1, tum3) have none.

In terms of time, I have tried with two 100,000-values-long vectors, composed by random '0's and '1's, the comparison seemed instant to me. Reversing and joining the table was the most time-consuming step.

ADD COMMENT

Login before adding your answer.

Traffic: 2039 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6