Hello all,
I am fairly new to programming, so this is probably a very simple question. I have two files with ~100 sequences each. Each file contains a separate (aligned) gene. A given sequence in file1 comes from the exact same organism as a sequence in file2. However, they are not in order!
What I am trying to do is combine the two files so that sequence2 is tacked on to the end of sequence1. (I am not super concerned with formatting the sequence header, I know how to do that with regexps.)
The headers for two matching sequences (one from file1, one from file2) are shown below.
>tr|A0A097ATJ0|A0A097ATJ0_THEKI CO dehydrogenase/acetyl-CoA synthase complex, beta subunit OS=Thermoanaerobacter kivui GN=acsB PE=4 SV=1
>tr|A0A097ATM9|A0A097ATM9_THEKI Carbon-monoxide dehydrogenase catalytic subunit acsA OS=Thermoanaerobacter kivui GN=acsA PE=4 SV=1
Does anyone have any recommendations on how to even approach this? All of the things I have tried so far have been very convoluted.
Thank you all in advance.
Disclaimer, the following content is for the morbidly curious- viewer discretion is advised:
my approach up to this point was to create some sort of table using Python that would have the first column filled with "OS=(organism_name)" that I could parse from the header. Then I was thinking of loading the sequences for that organism into two separate columns. Finally, I was then going to write out to one file. I don't know how to do a lot of this, but could learn.
Thank you, Pierre. Since I am sort of new to unix, would you mind explaining the basic gist of what is going on in this script?