Question: Combining two sequence files
gravatar for cdarwin
2.5 years ago by
United States
cdarwin0 wrote:

Hello all, 

I am fairly new to programming, so this is probably a very simple question. I have two files with ~100 sequences each. Each file contains a separate (aligned) gene. A given sequence in file1 comes from the exact same organism as a sequence in file2. However, they are not in order! 

What I am trying to do is combine the two files so that sequence2 is tacked on to the end of sequence1. (I am not super concerned with formatting the sequence header, I know how to do that with regexps.)

The headers for two matching sequences (one from file1, one from file2) are shown below. 

>tr|A0A097ATJ0|A0A097ATJ0_THEKI CO dehydrogenase/acetyl-CoA synthase complex, beta subunit OS=Thermoanaerobacter kivui GN=acsB PE=4 SV=1​

>tr|A0A097ATM9|A0A097ATM9_THEKI Carbon-monoxide dehydrogenase catalytic subunit acsA OS=Thermoanaerobacter kivui GN=acsA PE=4 SV=1


Does anyone have any recommendations on how to even approach this? All of the things I have tried so far have been very convoluted. 

Thank you all in advance. 


Disclaimer, the following content is for the morbidly curious- viewer discretion is advised : 

my approach up to this point was to create some sort of table using Python that would have the first column filled with "OS=(organism_name)" that I could parse from the header. Then I was thinking of loading the sequences for that organism into two separate columns. Finally, I was then going to write out to one file. I don't know how to do a lot of this, but could learn. 

sequence forum fasta • 841 views
ADD COMMENTlink modified 2.5 years ago by Pierre Lindenbaum109k • written 2.5 years ago by cdarwin0
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum109k wrote:
cat sequence1.fa sequence2.fa |\
awk -f linearize.awk |\
awk '{O=index($0,"OS=");G=index($0,"GN=");printf("%s\t%s\n",substr($0,O+3,(G-O)-3,$0);}' |\
LC_ALL=C sort -k1,1 | cut -f 2- |\
tr "\t" "\n" > out.fa



linearize.awk :

ADD COMMENTlink written 2.5 years ago by Pierre Lindenbaum109k

Thank you, Pierre. Since I am sort of new to unix, would you mind explaining the basic gist of what is going on in this script? 


ADD REPLYlink written 2.5 years ago by cdarwin0

* concatenate bith sequence

* linearize

* use awk . search for "OS=" and "GN=", prepend a new column with the substring between those two words.

* sort on first column

* remove first column

* convert back to fasta by converting 'tab' to 'return'

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum109k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 734 users visited in the last hour