Question: Assembling a sequence from the pairs of letters (X,Y)
1
2.6 years ago by
Bogdan1.1k
Palo Alto, CA, USA
Bogdan1.1k wrote:

Dear all,

Please could you advise on an algorithm that could solve a relatively easy problem (I have just started recalling and reviewing very old informatics classes on graph theory, recursion and dynamic programming).

the computational problem is : considering a sequence of pairs of type `(X, Y)`,for example:

``````(A, B)
(C, D)
(B, E)
(Z, T)
(W, A)
(G, T)
(Z, I)
``````

What is the optimal strategy to connect these pairs of letters into a sequence :

`````` W -- > A,  A -->B,  B --> E.
``````

Thank you very much,

Bogdan

sequence R • 536 views
ADD COMMENTlink
modified 2.6 years ago by zx87549.9k • written 2.6 years ago by Bogdan1.1k
1
2.6 years ago by
d-cameron2.3k
Australia
d-cameron2.3k wrote:

what is the optimal strategy to connect these pairs of letters into a sequence

Typically assembler are looking for a Eulerian path, Hamiltonian cycle, or a variation of one of those approaches.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by d-cameron2.3k

Thank you Daniel. Yes, I would think that the problem can be re-stated in terms of finding an Eulerian path in a directed graph ;

I would think that some packages (Cytoscape, igraph, etc) may have the functions to compute Eulerian, Hamiltonian path, cycles or cliques. Which package (in R, or Python, etc) would you recommend for distinct calculations on the graphs ? Thank you !

ADD REPLYlink written 2.6 years ago by Bogdan1.1k

Dear Daniel,

as I do not have your email address, please may I post here a question about BioMart and StructuralVariantAnnotation. I am using the piece of R code below, that was inspired by the package StructuralVariantAnnotation that you have written in order to annotate the Structural Variants from DELLY. :

However, the coordinates on chr21 are not annotated properly, in the sense that : shall I input the following coordinates for a breakpoint "chr21:10813930-10813931", it gives me the gene annotations such as "SMIM11B,U2AF1L5,LOC102724652,CRYAA,U2AF1,CBS"

I can send you the full code in R, if you wish. Would you please let me know --- is there a way to fix it with biomart ? thank you very much !

-- bogdan

<h6>############### the piece of R code that I am using is the following :</h6>

*

``````gns <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
hits <- as.data.frame(findOverlaps(gr, gns, ignore.strand=TRUE))

hits\$SYMBOL <- biomaRt::select(org.Hs.eg.db, gns[hits\$subjectHits]\$gene_id, "SYMBOL")\$SYMBOL
hits\$gene_strand <- as.character(strand(gns[hits\$subjectHits]))
hits <- hits %>%
group_by(queryHits) %>%
summarise(SYMBOL=paste(SYMBOL, collapse=","), gene_strand=paste0(gene_strand, collapse=""))
``````
*
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Bogdan1.1k

gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

genes(hg19) returns some really strange results. You'll want to use transcripts() instead as some genes are annotated have two transcripts 100+Mb apart so genes() matches almost the whole chromosome for those.

The clinical pipelines I'm involved in uses transcript() based code but I never got around rewriting the GRIDSS example code. Sorry about that.

ADD REPLYlink written 2.6 years ago by d-cameron2.3k

Thank you : yes, using transcripts() instead of genes() is a very good suggestion. Thanks !

ADD REPLYlink written 2.6 years ago by Bogdan1.1k
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1504 users visited in the last hour
_