Combining two sequence files
1
0
Entering edit mode
8.3 years ago
cdarwin • 0

Hello all,

I am fairly new to programming, so this is probably a very simple question. I have two files with ~100 sequences each. Each file contains a separate (aligned) gene. A given sequence in file1 comes from the exact same organism as a sequence in file2. However, they are not in order!

What I am trying to do is combine the two files so that sequence2 is tacked on to the end of sequence1. (I am not super concerned with formatting the sequence header, I know how to do that with regexps.)

The headers for two matching sequences (one from file1, one from file2) are shown below.

>tr|A0A097ATJ0|A0A097ATJ0_THEKI CO dehydrogenase/acetyl-CoA synthase complex, beta subunit OS=Thermoanaerobacter kivui GN=acsB PE=4 SV=1​
>tr|A0A097ATM9|A0A097ATM9_THEKI Carbon-monoxide dehydrogenase catalytic subunit acsA OS=Thermoanaerobacter kivui GN=acsA PE=4 SV=1

Does anyone have any recommendations on how to even approach this? All of the things I have tried so far have been very convoluted.

Thank you all in advance.

Disclaimer, the following content is for the morbidly curious- viewer discretion is advised:

my approach up to this point was to create some sort of table using Python that would have the first column filled with "OS=(organism_name)" that I could parse from the header. Then I was thinking of loading the sequences for that organism into two separate columns. Finally, I was then going to write out to one file. I don't know how to do a lot of this, but could learn.

sequence fasta • 1.6k views
ADD COMMENT
0
Entering edit mode
8.3 years ago
cat sequence1.fa sequence2.fa |\
awk -f linearize.awk |\
awk '{O=index($0,"OS=");G=index($0,"GN=");printf("%s\t%s\n",substr($0,O+3,(G-O)-3,$0);}' |\
LC_ALL=C sort -k1,1 | cut -f 2- |\
tr "\t" "\n" > out.fa

linearize.awk :

ADD COMMENT
0
Entering edit mode

Thank you, Pierre. Since I am sort of new to unix, would you mind explaining the basic gist of what is going on in this script?

ADD REPLY
0
Entering edit mode
  • concatenate bith sequence
  • linearize
  • use awk . search for "OS=" and "GN=", prepend a new column with the substring between those two words.
  • sort on first column
  • remove first column
  • convert back to fasta by converting 'tab' to 'return'
ADD REPLY

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6