Indel coding in python
1
1
Entering edit mode
3.7 years ago
adrian18_07 ▴ 10

I have 2 sequences with gaps:

15888 CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG
42690 CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG

I would like to encode indels in binary form. Gap is marked with 0 and no gap is 1. I want to receive something like this:

15888 0 0 1
42690 1 1 0

Thanks for any answer.

indel coding biopython • 1.8k views
ADD COMMENT
1
Entering edit mode

Did you try anything? The idea of online communities is typically not to be code factories but to support each other by providing advise, help and sharing experiences.

ADD REPLY
0
Entering edit mode

I don't expect to get the code ready. I am asking for advice on how I could start, because I am a beginner.

ADD REPLY
0
Entering edit mode

You should expect that the people answring you will be willing to put in a fraction of the amount of work answering you as you have demonstrated you have put into solving your own problem. How much effort do you think you have demonstrated here?

ADD REPLY
0
Entering edit mode

Is it always 2 sequences? If not, is one always considered the reference? - since your result will be subjective on the frame of reference.

ADD REPLY
0
Entering edit mode

First I want to do it for these 2 sequences and then expand it.

ADD REPLY
3
Entering edit mode

It's important to frame the larger question carefully, because a solution that works for 2 sequences may not easily generalise to more than 2, hence my questions.

What if, for example, the data looked like this:

15888 CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG
42690 CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG
12345 CATG----ACAGA---ACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG

Is that first gap a deletion in 1 and 3, or an insertion in 2? What about the new deletion I've added which isn't present in any of the other sequences and now changes the numbering? There is no easy way to decide what the output should be without stating your frame of reference.

ADD REPLY
2
Entering edit mode

Or how would you notate:

15888 CATG----ACAGAGCGACCCGCG---ACGTTACAAACACTACGCGGGGTGGCCCCGG
42690 CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG
12345 CATG----ACAGA---ACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG

That's two different gaps at the same coordinates.

This seems like one of those situations where rather than asking "How do I do X", you need to be able to articulate why are you are doing X, and ask "How do I do Y", where Y is your real analysis goal.

ADD REPLY
0
Entering edit mode

Ultimately, the code is to be for an unlimited number of sequences. For example: for 10 or 30 sequences. I generally treat insertion or deletion as indel.

ADD REPLY
0
Entering edit mode

What you're asking for is impossible without comparing to a specific reference.

ADD REPLY
2
Entering edit mode
3.7 years ago
zx8754 11k

As suggested, consider the the larger question. But, based on assumptions from your example, looks like an interesting problem.

Here is using R, not going to explain the code in detail. In short, find position of "-" in all strings, create groups based on consecutive positions, i.e.: make "indels" ranges, then compare those position for each string.

foo <- function(x){
  xSplit <- strsplit(x, "")

  # indels
  indels <- unique(
    unlist(
      lapply(xSplit, function(i){ 
        ix <- which(i == "-")
        g <- c(0, cumsum(diff(ix) > 1))
        split(ix, g)
      }),  recursive = FALSE))
  # sort by position
  indels <- indels[order(sapply(indels, `[`, 1), lengths(indels))]
  names(indels) <- paste0("indel_", seq(length(indels)))

  # return binary
  t(
    sapply(xSplit, function(i){
      sapply(indels, function(j){
        !all(i[ j ] == "-")
      })
    }) * 1
  )
}
# example data
x1 <- c("CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG",
        "CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG")
x2 <- c("CATG----ACAGAGCGACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG",
        "CATGCCCGACAGAGCGACCCGCGAACACGTTACAAACACTACG---GGTGGCCCCGG",
        "CATG----ACAGA---ACCCGCG--CACGTTACAAACACTACGCGGGGTGGCCCCGG")

# test
foo(x1)
#      indel_1 indel_2 indel_3
# [1,]       0       0       1
# [2,]       1       1       0

foo(x2)
#      indel_1 indel_2 indel_3 indel_4
# [1,]       0       1       0       1
# [2,]       1       1       1       0
# [3,]       0       0       0       1
ADD COMMENT
0
Entering edit mode

I think for sequence 1 in x2, it should be 0, 1, 0, 1, and sequence 3 in x2 it should be 0, 0, 0, 1, if I understand the author correctly.

ADD REPLY
0
Entering edit mode

Thank you. Forgot to sort by position, now fixed.

ADD REPLY
0
Entering edit mode

Thank you very much.

ADD REPLY

Login before adding your answer.

Traffic: 3555 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6