Question

Assembling a sequence from the pairs of letters (X,Y)

1

Entering edit mode

5.9 years ago

Bogdan ★ 1.4k

Dear all,

Please could you advise on an algorithm that could solve a relatively easy problem (I have just started recalling and reviewing very old informatics classes on graph theory, recursion and dynamic programming).

the computational problem is : considering a sequence of pairs of type (X, Y),for example:

(A, B)
(C, D)
(B, E)
(Z, T)
(W, A)
(G, T)
(Z, I)

What is the optimal strategy to connect these pairs of letters into a sequence :

 W -- > A,  A -->B,  B --> E.

Thank you very much,

Bogdan

R sequence • 1.4k views

ADD COMMENT • link updated 5.9 years ago by zx8754 11k • written 5.9 years ago by Bogdan ★ 1.4k

score 1 · Answer 1 · 2018-06-17

1

Entering edit mode

5.9 years ago

d-cameron ★ 2.9k

what is the optimal strategy to connect these pairs of letters into a sequence

Typically assembler are looking for a Eulerian path, Hamiltonian cycle, or a variation of one of those approaches.

ADD COMMENT • link 5.9 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Thank you Daniel. Yes, I would think that the problem can be re-stated in terms of finding an Eulerian path in a directed graph ;

I would think that some packages (Cytoscape, igraph, etc) may have the functions to compute Eulerian, Hamiltonian path, cycles or cliques. Which package (in R, or Python, etc) would you recommend for distinct calculations on the graphs ? Thank you !

ADD REPLY • link 5.9 years ago by Bogdan ★ 1.4k

0

Entering edit mode

Dear Daniel,

as I do not have your email address, please may I post here a question about BioMart and StructuralVariantAnnotation. I am using the piece of R code below, that was inspired by the package StructuralVariantAnnotation that you have written in order to annotate the Structural Variants from DELLY. :

https://github.com/PapenfussLab/gridss/blob/master/example/somatic-fusion-gene-candidates.R;

However, the coordinates on chr21 are not annotated properly, in the sense that : shall I input the following coordinates for a breakpoint "chr21:10813930-10813931", it gives me the gene annotations such as "SMIM11B,U2AF1L5,LOC102724652,CRYAA,U2AF1,CBS"

I can send you the full code in R, if you wish. Would you please let me know --- is there a way to fix it with biomart ? thank you very much !

-- bogdan

<h6>############### the piece of R code that I am using is the following :</h6>

*

gns <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
hits <- as.data.frame(findOverlaps(gr, gns, ignore.strand=TRUE))

hits$SYMBOL <- biomaRt::select(org.Hs.eg.db, gns[hits$subjectHits]$gene_id, "SYMBOL")$SYMBOL
hits$gene_strand <- as.character(strand(gns[hits$subjectHits]))
hits <- hits %>%
  group_by(queryHits) %>%
  summarise(SYMBOL=paste(SYMBOL, collapse=","), gene_strand=paste0(gene_strand, collapse=""))

*

ADD REPLY • link 5.8 years ago by Bogdan ★ 1.4k

0

Entering edit mode

gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

genes(hg19) returns some really strange results. You'll want to use transcripts() instead as some genes are annotated have two transcripts 100+Mb apart so genes() matches almost the whole chromosome for those.

The clinical pipelines I'm involved in uses transcript() based code but I never got around rewriting the GRIDSS example code. Sorry about that.

ADD REPLY • link 5.8 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Thank you : yes, using transcripts() instead of genes() is a very good suggestion. Thanks !

ADD REPLY • link 5.8 years ago by Bogdan ★ 1.4k