Indel coding in python
1
1
Entering edit mode
21 months ago

I have 2 sequences with gaps:

15888 CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG
42690 CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG


I would like to encode indels in binary form. Gap is marked with 0 and no gap is 1. I want to receive something like this:

15888 0 0 1
42690 1 1 0


indel coding biopython • 882 views
1
Entering edit mode

Did you try anything? The idea of online communities is typically not to be code factories but to support each other by providing advise, help and sharing experiences.

0
Entering edit mode

I don't expect to get the code ready. I am asking for advice on how I could start, because I am a beginner.

0
Entering edit mode

You should expect that the people answring you will be willing to put in a fraction of the amount of work answering you as you have demonstrated you have put into solving your own problem. How much effort do you think you have demonstrated here?

0
Entering edit mode

Is it always 2 sequences? If not, is one always considered the reference? - since your result will be subjective on the frame of reference.

0
Entering edit mode

First I want to do it for these 2 sequences and then expand it.

3
Entering edit mode

It's important to frame the larger question carefully, because a solution that works for 2 sequences may not easily generalise to more than 2, hence my questions.

What if, for example, the data looked like this:

15888 CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG
42690 CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG
12345 CATG----ACAGA---ACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG


Is that first gap a deletion in 1 and 3, or an insertion in 2? What about the new deletion I've added which isn't present in any of the other sequences and now changes the numbering? There is no easy way to decide what the output should be without stating your frame of reference.

2
Entering edit mode

Or how would you notate:

15888 CATG----ACAGAGCGACCCGCG---ACGTTACAAACACTACGCGGGGTGGCCCCGG
42690 CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG
12345 CATG----ACAGA---ACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG


That's two different gaps at the same coordinates.

This seems like one of those situations where rather than asking "How do I do X", you need to be able to articulate why are you are doing X, and ask "How do I do Y", where Y is your real analysis goal.

0
Entering edit mode

Ultimately, the code is to be for an unlimited number of sequences. For example: for 10 or 30 sequences. I generally treat insertion or deletion as indel.

0
Entering edit mode

What you're asking for is impossible without comparing to a specific reference.

2
Entering edit mode
21 months ago
zx8754 11k

As suggested, consider the the larger question. But, based on assumptions from your example, looks like an interesting problem.

Here is using R, not going to explain the code in detail. In short, find position of "-" in all strings, create groups based on consecutive positions, i.e.: make "indels" ranges, then compare those position for each string.

foo <- function(x){
xSplit <- strsplit(x, "")

# indels
indels <- unique(
unlist(
lapply(xSplit, function(i){
ix <- which(i == "-")
g <- c(0, cumsum(diff(ix) > 1))
split(ix, g)
}),  recursive = FALSE))
# sort by position
indels <- indels[order(sapply(indels, [, 1), lengths(indels))]
names(indels) <- paste0("indel_", seq(length(indels)))

# return binary
t(
sapply(xSplit, function(i){
sapply(indels, function(j){
!all(i[ j ] == "-")
})
}) * 1
)
}
# example data
x1 <- c("CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG",
"CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG")
x2 <- c("CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG",
"CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG",
"CATG----ACAGA---ACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG")

# test
foo(x1)
#      indel_1 indel_2 indel_3
# [1,]       0       0       1
# [2,]       1       1       0

foo(x2)
#      indel_1 indel_2 indel_3 indel_4
# [1,]       0       1       0       1
# [2,]       1       1       1       0
# [3,]       0       0       0       1

0
Entering edit mode

I think for sequence 1 in x2, it should be 0, 1, 0, 1, and sequence 3 in x2 it should be 0, 0, 0, 1, if I understand the author correctly.

0
Entering edit mode

Thank you. Forgot to sort by position, now fixed.

0
Entering edit mode

Thank you very much.