Question: Combining two sequence files
0
gravatar for cdarwin
2.8 years ago by
cdarwin0
United States
cdarwin0 wrote:

Hello all, 

I am fairly new to programming, so this is probably a very simple question. I have two files with ~100 sequences each. Each file contains a separate (aligned) gene. A given sequence in file1 comes from the exact same organism as a sequence in file2. However, they are not in order! 

What I am trying to do is combine the two files so that sequence2 is tacked on to the end of sequence1. (I am not super concerned with formatting the sequence header, I know how to do that with regexps.)

The headers for two matching sequences (one from file1, one from file2) are shown below. 

>tr|A0A097ATJ0|A0A097ATJ0_THEKI CO dehydrogenase/acetyl-CoA synthase complex, beta subunit OS=Thermoanaerobacter kivui GN=acsB PE=4 SV=1​

>tr|A0A097ATM9|A0A097ATM9_THEKI Carbon-monoxide dehydrogenase catalytic subunit acsA OS=Thermoanaerobacter kivui GN=acsA PE=4 SV=1

 

Does anyone have any recommendations on how to even approach this? All of the things I have tried so far have been very convoluted. 

Thank you all in advance. 

 

Disclaimer, the following content is for the morbidly curious- viewer discretion is advised : 

my approach up to this point was to create some sort of table using Python that would have the first column filled with "OS=(organism_name)" that I could parse from the header. Then I was thinking of loading the sequences for that organism into two separate columns. Finally, I was then going to write out to one file. I don't know how to do a lot of this, but could learn. 

sequence forum fasta • 914 views
ADD COMMENTlink modified 2.8 years ago by Pierre Lindenbaum114k • written 2.8 years ago by cdarwin0
0
gravatar for Pierre Lindenbaum
2.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum114k wrote:
cat sequence1.fa sequence2.fa |\
awk -f linearize.awk |\
awk '{O=index($0,"OS=");G=index($0,"GN=");printf("%s\t%s\n",substr($0,O+3,(G-O)-3,$0);}' |\
LC_ALL=C sort -k1,1 | cut -f 2- |\
tr "\t" "\n" > out.fa

 

 

linearize.awk : https://gist.github.com/lindenb/2c0d4e11fd8a96d4c345

ADD COMMENTlink written 2.8 years ago by Pierre Lindenbaum114k

Thank you, Pierre. Since I am sort of new to unix, would you mind explaining the basic gist of what is going on in this script? 

 

ADD REPLYlink written 2.8 years ago by cdarwin0

* concatenate bith sequence

* linearize

* use awk . search for "OS=" and "GN=", prepend a new column with the substring between those two words.

* sort on first column

* remove first column

* convert back to fasta by converting 'tab' to 'return'

ADD REPLYlink written 2.8 years ago by Pierre Lindenbaum114k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 638 users visited in the last hour