Question

Combining two sequence files

0

Entering edit mode

8.3 years ago

cdarwin • 0

Hello all,

I am fairly new to programming, so this is probably a very simple question. I have two files with ~100 sequences each. Each file contains a separate (aligned) gene. A given sequence in file1 comes from the exact same organism as a sequence in file2. However, they are not in order!

What I am trying to do is combine the two files so that sequence2 is tacked on to the end of sequence1. (I am not super concerned with formatting the sequence header, I know how to do that with regexps.)

The headers for two matching sequences (one from file1, one from file2) are shown below.

>tr|A0A097ATJ0|A0A097ATJ0_THEKI CO dehydrogenase/acetyl-CoA synthase complex, beta subunit OS=Thermoanaerobacter kivui GN=acsB PE=4 SV=1
>tr|A0A097ATM9|A0A097ATM9_THEKI Carbon-monoxide dehydrogenase catalytic subunit acsA OS=Thermoanaerobacter kivui GN=acsA PE=4 SV=1

Does anyone have any recommendations on how to even approach this? All of the things I have tried so far have been very convoluted.

Thank you all in advance.

Disclaimer, the following content is for the morbidly curious- viewer discretion is advised:

my approach up to this point was to create some sort of table using Python that would have the first column filled with "OS=(organism_name)" that I could parse from the header. Then I was thinking of loading the sequences for that organism into two separate columns. Finally, I was then going to write out to one file. I don't know how to do a lot of this, but could learn.

sequence fasta • 1.6k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.3 years ago by cdarwin • 0

Ram · Answer 1 · 2016-01-17

0

Entering edit mode

8.3 years ago

Pierre Lindenbaum 161k

cat sequence1.fa sequence2.fa |\
awk -f linearize.awk |\
awk '{O=index($0,"OS=");G=index($0,"GN=");printf("%s\t%s\n",substr($0,O+3,(G-O)-3,$0);}' |\
LC_ALL=C sort -k1,1 | cut -f 2- |\
tr "\t" "\n" > out.fa

linearize.awk :

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you, Pierre. Since I am sort of new to unix, would you mind explaining the basic gist of what is going on in this script?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by cdarwin • 0

0

Entering edit mode

concatenate bith sequence
linearize
use awk . search for "OS=" and "GN=", prepend a new column with the substring between those two words.
sort on first column
remove first column
convert back to fasta by converting 'tab' to 'return'

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Pierre Lindenbaum 161k