Question: Assembling a sequence from the pairs of letters (X,Y)
1
gravatar for Bogdan
9 months ago by
Bogdan720
Palo Alto, CA, USA
Bogdan720 wrote:

Dear all,

Please could you advise on an algorithm that could solve a relatively easy problem (I have just started recalling and reviewing very old informatics classes on graph theory, recursion and dynamic programming).

the computational problem is : considering a sequence of pairs of type (X, Y),for example:

(A, B)
(C, D)
(B, E)
(Z, T)
(W, A)
(G, T)
(Z, I)

What is the optimal strategy to connect these pairs of letters into a sequence :

 W -- > A,  A -->B,  B --> E.

Thank you very much,

Bogdan

sequence R • 272 views
ADD COMMENTlink modified 9 months ago by zx87546.8k • written 9 months ago by Bogdan720
1
gravatar for d-cameron
9 months ago by
d-cameron2.0k
Australia
d-cameron2.0k wrote:

what is the optimal strategy to connect these pairs of letters into a sequence

Typically assembler are looking for a Eulerian path, Hamiltonian cycle, or a variation of one of those approaches.

ADD COMMENTlink modified 9 months ago • written 9 months ago by d-cameron2.0k

Thank you Daniel. Yes, I would think that the problem can be re-stated in terms of finding an Eulerian path in a directed graph ;

I would think that some packages (Cytoscape, igraph, etc) may have the functions to compute Eulerian, Hamiltonian path, cycles or cliques. Which package (in R, or Python, etc) would you recommend for distinct calculations on the graphs ? Thank you !

ADD REPLYlink written 9 months ago by Bogdan720

Dear Daniel,

as I do not have your email address, please may I post here a question about BioMart and StructuralVariantAnnotation. I am using the piece of R code below, that was inspired by the package StructuralVariantAnnotation that you have written in order to annotate the Structural Variants from DELLY. :

https://github.com/PapenfussLab/gridss/blob/master/example/somatic-fusion-gene-candidates.R;

However, the coordinates on chr21 are not annotated properly, in the sense that : shall I input the following coordinates for a breakpoint "chr21:10813930-10813931", it gives me the gene annotations such as "SMIM11B,U2AF1L5,LOC102724652,CRYAA,U2AF1,CBS"

I can send you the full code in R, if you wish. Would you please let me know --- is there a way to fix it with biomart ? thank you very much !

-- bogdan

<h6>############### the piece of R code that I am using is the following :</h6>

*

gns <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
hits <- as.data.frame(findOverlaps(gr, gns, ignore.strand=TRUE))

hits$SYMBOL <- biomaRt::select(org.Hs.eg.db, gns[hits$subjectHits]$gene_id, "SYMBOL")$SYMBOL
hits$gene_strand <- as.character(strand(gns[hits$subjectHits]))
hits <- hits %>%
  group_by(queryHits) %>%
  summarise(SYMBOL=paste(SYMBOL, collapse=","), gene_strand=paste0(gene_strand, collapse=""))
    *
ADD REPLYlink modified 8 months ago • written 8 months ago by Bogdan720

gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

genes(hg19) returns some really strange results. You'll want to use transcripts() instead as some genes are annotated have two transcripts 100+Mb apart so genes() matches almost the whole chromosome for those.

The clinical pipelines I'm involved in uses transcript() based code but I never got around rewriting the GRIDSS example code. Sorry about that.

ADD REPLYlink written 8 months ago by d-cameron2.0k

Thank you : yes, using transcripts() instead of genes() is a very good suggestion. Thanks !

ADD REPLYlink written 8 months ago by Bogdan720
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1880 users visited in the last hour