Tool:CBioInfCpp.h as a C++ lib containing some functions for bioinformatics
17
2
Entering edit mode
4.0 years ago

Dear Sirs.

Though I am not a professional programmer, bionformatics is very interesting interdisciplinary field for me.

I see it, the Python is a "standart language" in this field.

But when I solved problems at rosalind info, I used C++. So as a result a "lib of some function" has been borned.

The lib contains 3 groups of functions. The first one - input-output ones (in order to read-write vectors, matrixes, graphs from-to a file via only one commsnd as it is in Python).

The second group is "Working with strings". Contains some functions from computing GC-content, Edit Distance etc to finding all mutated strings in a given one.

The third is "Working with graphs". A data structure "Adjacency vector" is suggested. By the way, in general case, vertices may have negative integers assigned and graphs may have multiple loops and edges. Some function such as Eulerian Cycle, Path finding, topological sorting etc are implemented.

May it be useful for some tasks?

By the way, that algorithmic functions and problems should be included or maybe solved here?

I understand that this lib haven't a great majority of features. For example it is not able now to work with bioinformatic databases, but here I can not to implement it by myself only.

Free distributed source code and info is here: https://drive.google.com/open?id=1FQwsQm2kG_nTO45ab0yj52xtp6_B4IB2

(This is a link to directory (not to a file) that contains source code file and readme files)

My profile at Rosalind info http://rosalind.info/users/chernouhov/

Best regards, Chernouhov Sergey

Modifyed at 03 may 2019:

But:

• GitHub is a new experience for me, so probably I DO some mistakes there.
• Why only GitHub is the trusted place? We may be free but only there?

I Do declare that I DO NOT clearly understand all about GitHub so nowdays I use it only as a filehosting as it is so popular place.

Best regards, Chernouhov Sergey

Tool C++ tool c++ • 2.8k views
3
Entering edit mode

Hi, maybe you're interested in contributing to the c++ SeqAn library?

https://www.seqan.de/

0
Entering edit mode

Hi.

Thanks. It's a great idea. Why not?

By the way, maybe you use SeqAn library or maybe you participate in its development?

0
Entering edit mode

But: - GitHub is a new experience for me, so probably I DO some mistakes there. - Why only GitHub is the trusted place? We may be free but only there?

I Do declare that I DO NOT clearly understand all about GitHub so nowdays I use it only as a filehosting as it is so popular place.

2
Entering edit mode

It’s not the only trusted place, its just become the most common/most well known. You can also see the source code directly, versus having to trust a file blindly.

Bitbucket, sourceforge, gitlab etc are all still used, just to varying and lesser extents.

0
Entering edit mode

Well, we DID talk about GitHub but we DID NOT talk about the lib itself. Nowadays it is hosted at GitHub. But it is not it's key feature, as it is for every item - both good and bad - isn't it?

2
Entering edit mode

There is no need to post the same comment multiple times (I know these threads can get a little disorganised over time, but once is enough).

I'm not sure I understand your question? There's nothing more to say regarding the lib or github as far as I can see? You've uploaded it in a nice, visible place. If people want to use it, they'll use it.

0
Entering edit mode

GitHub is not just a file hosting site, it hosts and helps manage Git projects. By Git projects, I mean Git repositories with issues, Pull Requests, etc. Git repositories are essentially version-controlled code directories allowing for concurrent development and change tracking along with a host of other amazing features. If you're new to Git, you should definitely learn it as it will better your approach to software development.

2
Entering edit mode

I would consider putting the code on Github, rather than distributing it as a google link. People are often wary of downloading code from behind random links without first being able to inspect the source.

0
Entering edit mode

Hi. Thanks. It is a good idea and I plan to do it a little later (as I haven't used Github yet).

But I must confess as nowadays the lib CBioInfCpp consists only one header file (as free source code) it is not so bad to use google drive too? Also there are 2 files - pdf and rtf - that contain the same description of the functions of CBioInfCpp in different formats (pdf and rtf). One may use any of them depending on preferable format.

3
Entering edit mode

Well, look at it this way: I haven't clicked on your link yet, even though I trust you, because I don't know if it will take me to a page, or will start a download immediately. If it starts a download immediately, I don't know if I'm getting a zip file, naked source code, or something masquerading as either.

If you want to contribute to projects, or have people contribute to improving your code, github (and its friends) is absolutely the way to go. To get started with github you need only three commands really: git pull, git commit, git push. Everything else is a bonus ;)

There are plenty of good youtube tutorials etc to get you going.

0
Entering edit mode

I'll see it.

0
Entering edit mode

Sure, but its hard to tell that from the link alone, so people are unlikely to click it.

1
Entering edit mode

I also don't think google drive is appropriate. Software in the days of google code, sourceforge etc was (and still in some cases is) far more poorly documented, intransparent, and unversioned. As a developer I think you'll enjoy github very much.

1
Entering edit mode

What jrj.healey said was the first thought to cross my mind. I'm not clicking on a google drive link. I really want to look at the code, the code structure and a README before I decide if something is worth a download.

0
Entering edit mode

As I see it, why do not to try to implement any tool, at least CBioInfCpp.h?

Maybe, there are any interesting problems for strings, graphs, etc?

As well why do not use for in/ out solving other tasks?

0
Entering edit mode

Please don't add answers unless you are responding to the opening post. This is just a comment so I have moved it.

That said, I don't really understand your comment - what are you asking?

As I said before, you've already uploaded your code, if people find it, and want to use it, they will - there's nothing more to be done...

0
Entering edit mode

It is my language troubles, I see.

I mean there may be some problems to solve and that it is interesting for me to solve such problems: both using this lib or no.

1
Entering edit mode
3.9 years ago

But: - GitHub is a new experience for me, so probably I DO some mistakes there. - Why only GitHub is the trusted place? We may be free but only there?

I Do declare that I DO NOT clearly understand all about GitHub so nowdays I use it only as a filehosting as it is so popular place.

0
Entering edit mode

Github is now owned by Microsoft. I'd prefer it not to be owned by a big tech company, but that's life. Alternatives are gitlab. Nice one for making your (first?) steps into git, I doubt you'll regret it.

1
Entering edit mode
2.9 years ago

22.04.2020

• Modified function GenerateAlphabet for a single string.
• Added group of function MakeSubgraphSetOfVertices to generate a subgraphs of a given graph (set by Adjacency vector) and a set/ unordered_set of vertices to be chosen.
0
Entering edit mode
3.8 years ago

23/06/2019 update:

• Group of function "FindIn" has been updated.
• Functions PairVectorCout, PairVectorFout has been updated.
• Group of function "GraphCout" and "GraphFout" has been added. So nowadays one may "cout/ fout" a graph that is set by Adjacency vector to screen/ to file line by line: one edge in one line.
• Function "StrToCircular" added for finding the circular string of minimal length of the given one.
• Group of function MaxFlowGraph" has been added to help find Maximal Flow, the paths of the maximal flow network and max-flow min-cut in a graph.
• A data structure "Adjacency map" (a modification of data structure for containing graphs "Adjacency vector") has been added. Adjacency map allows to have quicker access to edge’s weight, but it can’t work with multiple edges.
• Function TandemRepeatsFinding has been added. It is intended for finding tandem repeats in the given string that may be useful for solving problems related to Microsatellite Instability etc.
0
Entering edit mode
3.7 years ago

14.07.2019 update:

• Function CIGAR1 has been added.
• Group of function "GraphCout" and "GraphFout" has been updated (so nowadays one may "cout/ fout" a graph that is set by both Adjacency vector and Adjacency map to screen/ to file line by line: one edge in one line).
• Function EditDistA as an extended version of the function EditDist has been added (returns not only the value of Edit Distance between 2 strings but also one possible version of the alignment itself).
0
Entering edit mode
3.6 years ago

09.08.2019 update:

• Group of function "NBPaths" (for finding maximal non-branching paths in a graph, both weighted or no, directed or no) has been added.
• Functions ConsStringQ1 and ConsStringQ2 for building consensus string upon a given collection of strings according to their quality has been added. Note that due to little data for testing errors may be found here (please notify if you found any).
0
Entering edit mode
3.5 years ago

31.08.2019 update:

• Function GenRandomUWGraph that generates a random unweighted graph (as its "Adjacency vector") has been added.
• Group of function intended to find collection of vertices for each strongly connected component of directed graph and to find collection of vertices for each connected component of undirected graph has been added.
• Group of function for counting edges multiplicity of a graph that is set by Adjacency vector has been added.
0
Entering edit mode
3.5 years ago

19.10.2019:

• Updated Group of function GraphCout and GraphFout to deal with mega-maps.

0
Entering edit mode
3.4 years ago

03.11.2019

• Group of functions Num updated.
• Function ScoreStringMatrix that counts score (i.e. total number of mismatches) upon vector a of strings s added.
• Function GPPM that generates a position probability matrix (PPM) added. Note that pseudocounts may be used (the formula (Ns+z)/(N+2*z) is implemented).
0
Entering edit mode
3.3 years ago

26.11.2019

• For the functions ConsStringQ1 and ConsStringQ2 (intended for finding consesus string, in doing so quality may be taken into consideration or no) the default method is set = 1.
• Function JoinOverlapStrings for joining overlapping strings has been added (in doing so, quality may be taken into consideration or no). So if we need to join collection 0->ACGT, 1->TGTA, 1->TT, 10->TT, 11->TCA in any way without any additional info,we should set NoQuality = true, Aggregate = false, and have the result: 0->ATGTA, 10->TTC.
• Function ProfileProbableMer to find all most probable j-mers in a given string upon a given position probability matrix (PPM) has been added.
• Function CycleToPath has been added.
0
Entering edit mode
3.2 years ago

11.01.2020

• Added group of functions UPGMA_UndirectedGraph and NeighborJoiningUndirectedGraph for tree generating (as undirected graph) upon a given distance matrix.
0
Entering edit mode
3.1 years ago

05.03.2020:

• Added experimental functions for finding all cycles in a graph (Circles_in_Graph) and all find all paths between any two vertices in a directed graph (AllPathsDGraph).

06.03.2020:

• Added function SubGraphsInscribed to solve the particular case of the problem of finding in a some graph A all subgraphs that are isomorphic to given graph B (can be found “inscribed” subgraphs only); The function may be also used to check if 2 graphs are isomorphic. This function can work with:  directed or undirected graphs,  graphs that have more than one connected components/ strongly connected components,  graphs that contain multiple edges. "Inscribed" means here that (1) this subgraph is "glued" to other parts of A only by edges that connected to its vertices that are begin/ end ones of any max-length non-branching path of this subgraph and/ or (2) graph A may have some other connected components. I.e. for graph B = {0->2, 10->2, 2->3, 3->4, 4->5, 4->6} we will find only A1 = {0->2, 1->2, 2->3, 3->4, 4->5, 4->6} as inscribed isomorphic subgraph of A = {0->2, 7->1, 1->2, 2->3, 3->4, 4->5, 4->6}. But if we add edge 3->8 to A (in this case A = {3->8, 0->2, 7->1, 1->2, 2->3, 3->4, 4->5, 4->6}), we couldn't find any inscribed isomorphic to B subgraph of A.

Preprint (in Russian) on this approach to solve (sub)graph isomorphism problem is here: dx.doi.org/10.24108/preprints-3111977

0
Entering edit mode
3.0 years ago

29.03.2020

• Added functions MedianString and GenerateAlphabet.
0
Entering edit mode
2.9 years ago

some on isomorphic (sub)graph finding (examples and time estimating): On (sub)graph isomorphism

0
Entering edit mode
2.7 years ago

10.07.2020

• Function SuffixTreeMake (to make a suffix tree upon a string) and CoutSuffixTree & FoutSuffixTree (to out suffix tree to screen or to file) has been added. Suffix Tree will be contained in the vector of integers Tree, every edge as quartet of integers: number of the start-vertex of edge, number of end-vertex of edge, starting position of substring of the basic string, the length of this substring.
0
Entering edit mode
2.3 years ago

05.12.2020

• The extended experimental version of the function SubGraphsInscribed have been added. This extention/ modifacation is done by working with all edges of the input graphs instead of working with non-branching paths. If InscribedOnly == false the function finds all (not only inscribed) subgraphs of unweighted graph A that are isomorphic to unweighted graph B. If InscribedOnly == true the function looks for "inscribed" ones only.

Note1. Working time rather depends on input data. If A and B has much simular segments it will works very-very-very long. But if no - much faster. For example if they have a cycle with one edge having multiplicity = 3, etc.

Note 2 For undirected graphs function will works much slower

Here are test results for 05/12/2020: https://github.com/chernouhov/CBioInfCpp-0-/tree/master/TestsIsomorphicSubGraphsFinding

In particular, it found

• for a directed graph B (15 vertices-20edged) - 4536 isomprphic subgraphs in directed graph A(250-350), ~ 3 sec,

• for a directed graph B (25-35) - 82546 isomprphic subgraphs in directed graph A (2500-3500), ~ 1 min 40 sec

• for an undirected graph B (15 vertices-20edged) - 69572 isomprphic subgraphs in undirected graph A(250-350), ~ 5 min

0
Entering edit mode
2.2 years ago

12.01.2021

• Function SubGraphsInscribedM - i.e. an experimental version of the function SubGraphsInscribed - has been added. SubGraphsInscribedM can find subgraphs in a given A that are isomorphic to a given template graph B too, but new is that vertices of these graphs may have marks. It may be useful for chemistry as one may associate an atom to some vertex (in case a molecule is set by graph).

13.01.2021

0
Entering edit mode
2.1 years ago

Approbation

Function SubGraphsInscribed is used by https://graphonline.ru/ (https://github.com/UnickSoft/GraphOffline, https://github.com/UnickSoft/graphonline) for (sub)graph isomorphism problem solving.