Question

Categorizing recorded information in an unfamiliar format?

0

Entering edit mode

13 months ago

zainabi8077 ▴ 20

I have a big quantity of recorded data (perhaps hundreds of thousands of records) that I need to be able to break down so that I can both classify it and make "typical" data myself. Allow me to elaborate...

If I have the following data strings:

142T339G1P112S
164T797F5A498S
144T989B9B223T
155T928X9Z554T

... You may begin to deduce the following:

The fourth, eighth, tenth, and fourteenth characters may always be alphas, while the rest are numeric the first character may always be a '1' the fourth character may always be the letter 'T' the fourteenth character may be confined to just being 'S' or 'T' and so on...

Some of these "rules" may evaporate when additional samples of real data are obtained; if you see a 15 character long string, you have proof that the first "rule" is erroneous. However, if you have a sufficiently large sample of strings that are exactly 14 characters long, you can begin to assume that "all strings are 14 characters long" and assign a numerical figure to your degree of confidence (with an appropriate set of assumptions based on the fact that you're seeing a suitably random set of all possible captured data). As you might expect, a person can accomplish a lot of this classification by sight, but I'm not aware of any libraries or methods that would enable a machine to do it.

Is there a library that I can use in my code to accomplish this type of categorization for me, identifying "rules" with a specific degree of confidence, given a collection of collected data (much more complicated than the above...)?

At an estimate, according to this article Python or Java (or maybe Perl or R) are the "common" languages most likely to include these kinds of tools, and perhaps certain bioinformatic libraries might as well. I don't care what language I have to use; I just need to tackle the problem in whatever manner I can.

Any kind of information referral would be quite helpful. As you can probably guess, I'm having trouble describing the situation accurately, and there may be a set of relevant phrases I can enter into Google to send me in the right direction.

python • 446 views

ADD COMMENT • link updated 13 months ago by Darked89 4.6k • written 13 months ago by zainabi8077 ▴ 20

score 1 · Answer 1 · 2023-03-14

For simplicity sake let's assume all your input strings have a fixed length. You may split them into individual characters and feed the whole set to Python polars library. With some luck/tweaking the pl.read_csv options you will get the fairly trivial answers if character at some position is a number or char. Then you can get some summaries, unique / repeated values in the whole data frame or in each of the columns. Finding correlation between columns (all vs all) or HMM models in a set of this size is another game.

python polars API