Command For Merging/Joining
4
0
Entering edit mode
7.3 years ago
akhalili26 • 0

Hi, My protein of interest is a multiple domain protein, and I would like to prepare a "structure based multiple sequence alignment" of it (around 1000 sequences).

Since the available structures of my protein are in different conformational states, I had to make multiple sequence alignments for EACH DOMAIN SEPARATELY. Now I have 6 high quality alignments, corresponding to the six domains of my protein.

The question is that: How can I merge/attach/join these files together? Notes:

• I am new to programming!
• All files are in Fasta format (also i have them in Clustal format)
• The length of ALL sequences in each file is the same (but the length of sequences in different files are different. For example in File 1, the length of all sequences is 120, and in file 2 all of them are 78)
• Sequences are arranged base on the "ID" in these files (i.e. the sequence IDs in the files are the same & are in the same order)
• I've used cat command, but it just added sequences at the end of file!
• Example files are presented below (I have shown only the first 4 sequences. there are around 1000 sequences). they are not real sequences.

Cheers, Ali

#

#

File 1:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
MLGWIAKKII


File 2:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--GWIAKKI-AAAA


File 3:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--GGYYYKLIAKKI-AAAA


File 4:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--------HHWSGYYYKLIAKKI-AAAA--D


File 5:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
AEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
EEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
EEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
EEE


File 6:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
A


So the final file should look like this: Result file:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DAEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA

multiple-alignment • 5.1k views
1
Entering edit mode

0
Entering edit mode

1
Entering edit mode

That looks like CLUSTAL format, or something, but is not at all fasta.

2
Entering edit mode

It's called FASTA multiple alignment format; similar to FASTA but gap characters ("-") are allowed - http://www.bioperl.org/wiki/FASTA_multiple_alignment_format.

0
Entering edit mode

do you want to merge the sequences together or files? basically, it is the same ID?

0
Entering edit mode

I didn't get your question but if you are trying to join files side by side instead of end to end, you can try "join" in UNIX. Check here: http://www.albany.edu/~ig4895/join.htm

0
Entering edit mode

Does all of your 1000 sequences have their name ending with "SecA"?

4
Entering edit mode
7.3 years ago
pld 4.9k

Here is a python script that will concat your alignments, assuming each alignment has the same id in the fasta header

import sys
import re
def __main__():
infile  = sys.argv[1]
outfile = sys.argv[2]

with open(infile, 'rb') as fi:
seqs = [x.split('\n')[:2] for x in seqs]

merge = dict()

#index 1 is seq
for x in seqs:
key = re.search('(?<=gi\|)([0-9]*)(?=\|)', x[0]).group(0)
try:
merge[key] = merge[key] + x[1]

except KeyError:
merge[key] = x[1]

#write results to file
with open(outfile, 'w') as fi:
for x in merge.keys():
fi.write('>' + x + '\n')
fi.write(merge[x] + '\n')

__main__()


To put your individual files into a single file:

cat f1.fasta, f2.fasta, ... > bigfile.fasta


You would then run the above script with:

python myscript.py bigfile.fasta outputfile.fasta


Just remember to adjust the file names here to whatever you're working with.

0
Entering edit mode

Sorry for the errors in the code, I verified the regex but typed it into the post, hence the errors. Everything works now.

1
Entering edit mode
7.3 years ago

If I understand correctly, you have six files (one for each domain) and each file having 1000 sequences.

1) Excluding the clustalw file for the first domain, run the following command for all the other five files. Make sure you have backed up your files so if something goes wrong, your original files remain unchanged.

sed -i ';s/.* //' File_name_2

sed -i ';s/.* //' File_name_3

........

sed -i ';s/.* //' File_name_6

2) Then use join command in unix and concatenate the first file with the second , third .. sixth file.

join File_name_1 File_name_2 ... File_name_6 > Mega_file

Hope it works.

0
Entering edit mode

I'm new to scripting! I got this when running sed -i ';s/.* //' domain2_cut_clustal

sed: 1: "domain2_cut_clustal": extra characters at the end of d command

0
Entering edit mode

There should be a gap between "'" and domain_cut_clustal

0
Entering edit mode

Thanks for your help. It does the job of combining the files. BUT, When I tried to use the final "Mega_file", i realized that it is not a Clustal file! since in the middle of the file some of the lines don't meet the Clustal's standard. (i. e. some of the lines have less than 60 sequence symbols. This is because, for example, the last lines of file 2, which are in the middle of the "Mega_file", have 24 sequence symbols (which i believe will be interpreted as the end of file!))

any suggestion for turning this "Mega file" into a Fasta file or Clustal file?

1
Entering edit mode
7.3 years ago
Neilfws 49k

Essentially, what you want to do regardless of programming language is:

1. Parse a set of "fasta-like" files (they are FASTA multiple alignment format which for all intents and purposes is fasta)
2. Extract the ID, description and sequence
3. For each unique ID + description, concatenate the associated sequences
4. Print results

So here is a quick, dirty and ugly solution which employs BioPerl SeqIO. I'm assuming a Linux-like system with a command line. BioPerl installation on Ubuntu-based Linux is as simple as sudo apt-get install bioperl.

First, assuming that your sequence files are the only files in their directory and they end with the suffix ".fa", concatenate them into one file:

cat *.fa > allseqs.fa


Now the BioPerl; save this as merge_seqs.pl:

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my %seqs; # a hash to hold the results
my $file = shift or die("Usage = merge_seqs.pl <fasta file>\n"); my$seqio = Bio::SeqIO->new(-file => $file, -format => "fasta"); while(my$seq = $seqio->next_seq) {$seqs{$seq->display_id." ".$seq->description} .= $seq->seq; } # now print foreach my$key (keys %seqs) {
print ">$key\n$seqs{$key}\n"; }  and run like so (output to the file merged.fa): perl merge_seqs.pl allseqs.fa > merged.fa  ADD COMMENT 0 Entering edit mode 7.3 years ago ole.tange ★ 4.0k This ought to do it: unix$ cat file* | perl -pe 's/\r\n?/\n/' | perl -ne 'chomp;if(/^>/) { $id =$_ } else { $content{$id}.=$_} END{print map { "$id\n$content{$id}\n" } keys %content}'  > file.all


It will work even if the order of the sequences inside the files are flipped around or if the lengths are different inside the files. It requires all files to be small enough to fit in memory.

Edit:

Modified to accept Apple \r and Microsoft \r\n line separation, too.

0
Entering edit mode

I ran your script. It took the last sequence of the last file and repeated it over 3000 times in the "file.all" file!

0
Entering edit mode

I have re-tested it on the 6 files you give, and I still get the correct result. Is there something very different hiding in your other files? Can you reproduce my results by running it on your 6 example files?

1
Entering edit mode

Were these files ever saved on Windows? That might be the issue.

0
Entering edit mode

That might just explain the results. Edited answer to support both Windows and Mac. Does it work now?