Question: Assembling a sequence from the pairs of letters (X,Y)
1
gravatar for Bogdan
18 months ago by
Bogdan890
Palo Alto, CA, USA
Bogdan890 wrote:

Dear all,

Please could you advise on an algorithm that could solve a relatively easy problem (I have just started recalling and reviewing very old informatics classes on graph theory, recursion and dynamic programming).

the computational problem is : considering a sequence of pairs of type (X, Y),for example:

(A, B)
(C, D)
(B, E)
(Z, T)
(W, A)
(G, T)
(Z, I)

What is the optimal strategy to connect these pairs of letters into a sequence :

 W -- > A,  A -->B,  B --> E.

Thank you very much,

Bogdan

sequence R • 382 views
ADD COMMENTlink modified 18 months ago by zx87548.8k • written 18 months ago by Bogdan890
1
gravatar for d-cameron
18 months ago by
d-cameron2.1k
Australia
d-cameron2.1k wrote:

what is the optimal strategy to connect these pairs of letters into a sequence

Typically assembler are looking for a Eulerian path, Hamiltonian cycle, or a variation of one of those approaches.

ADD COMMENTlink modified 18 months ago • written 18 months ago by d-cameron2.1k

Thank you Daniel. Yes, I would think that the problem can be re-stated in terms of finding an Eulerian path in a directed graph ;

I would think that some packages (Cytoscape, igraph, etc) may have the functions to compute Eulerian, Hamiltonian path, cycles or cliques. Which package (in R, or Python, etc) would you recommend for distinct calculations on the graphs ? Thank you !

ADD REPLYlink written 18 months ago by Bogdan890

Dear Daniel,

as I do not have your email address, please may I post here a question about BioMart and StructuralVariantAnnotation. I am using the piece of R code below, that was inspired by the package StructuralVariantAnnotation that you have written in order to annotate the Structural Variants from DELLY. :

https://github.com/PapenfussLab/gridss/blob/master/example/somatic-fusion-gene-candidates.R;

However, the coordinates on chr21 are not annotated properly, in the sense that : shall I input the following coordinates for a breakpoint "chr21:10813930-10813931", it gives me the gene annotations such as "SMIM11B,U2AF1L5,LOC102724652,CRYAA,U2AF1,CBS"

I can send you the full code in R, if you wish. Would you please let me know --- is there a way to fix it with biomart ? thank you very much !

-- bogdan

<h6>############### the piece of R code that I am using is the following :</h6>

*

gns <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
hits <- as.data.frame(findOverlaps(gr, gns, ignore.strand=TRUE))

hits$SYMBOL <- biomaRt::select(org.Hs.eg.db, gns[hits$subjectHits]$gene_id, "SYMBOL")$SYMBOL
hits$gene_strand <- as.character(strand(gns[hits$subjectHits]))
hits <- hits %>%
  group_by(queryHits) %>%
  summarise(SYMBOL=paste(SYMBOL, collapse=","), gene_strand=paste0(gene_strand, collapse=""))
    *
ADD REPLYlink modified 17 months ago • written 17 months ago by Bogdan890

gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

genes(hg19) returns some really strange results. You'll want to use transcripts() instead as some genes are annotated have two transcripts 100+Mb apart so genes() matches almost the whole chromosome for those.

The clinical pipelines I'm involved in uses transcript() based code but I never got around rewriting the GRIDSS example code. Sorry about that.

ADD REPLYlink written 17 months ago by d-cameron2.1k

Thank you : yes, using transcripts() instead of genes() is a very good suggestion. Thanks !

ADD REPLYlink written 17 months ago by Bogdan890
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1702 users visited in the last hour