^([A-Z][A-Z][0-9].*?) .*(chromosome) ([0-9]+).*
Apologies I think I misunderstood your post the first time around. You dont have 300 lines, you have 300 organisms (each with many lines).
Here, there are two approaches I can think of ...
First, I think this issue can be considered a bioinformatics problem, which is in part why it is easily soluble.
Obtain a list of Unique Identifiers exactly corresponding to those found in ALL your genome fasta files using $1 from the regex like the one I had in the original answer above. For instance, CM10030304.1, CM42994.1 on and on and on.
Go to a large database like RefSeq. For every UID in your list above, you want to now pull associated records from nuccore. This may be done, for instance, using eFetch.
Using any method of accessing
[nuccore] programmatically that you would like (like eFetch), obtain the records linked to the unique identifier.
Finally, process the output from 2. in such a way that you have a hashtable-like object in which the keys are the UID, and the values are the linked records. For instance, in
Dict() comprehension would work, such that you have something like:
Finally, simply loop over all lines in all 300 genome fasta files. For instance in
python3, we could write:
allLines=[l.strip() for l in f for f in fasta.readlines() for fasta in fastas]
for ls in allLines:
nuccoreUID=ls.re("regular expression that generates $1 in my answer above")
for nuccoreAttribute in allAttributeDict[nuccoreUID]:
Let me know if you have further questions, but this should more or less take care of everything.
shrinkingLine should now be a string containing ONLY the parts of your string that were precisely the hardest for which to define matches, as you note in your original post (i.e., LG10, chromosome some such, scaffold what-have-you, etc.). at this point, your computer should be holding everything in memory that you need to do what you propose in your post easily.
To close, I think there are other approaches to this problem, for instance an approach based on information entropy, that could be used. I do think that those and other approaches would not be a bioinformatics problem (more akin to statistical learning in that case).