Question: Command For Merging/Joining
0
gravatar for akhalili26
5.5 years ago by
akhalili260
akhalili260 wrote:

Hi, My protein of interest is a multiple domain protein, and I would like to prepare a "structure based multiple sequence alignment" of it (around 1000 sequences).

Since the available structures of my protein are in different conformational states, I had to make multiple sequence alignments for EACH DOMAIN SEPARATELY. Now I have 6 high quality alignments, corresponding to the six domains of my protein.

The question is that: How can I merge/attach/join these files together? Notes:

  • I am new to programming!
  • All files are in Fasta format (also i have them in Clustal format)
  • The length of ALL sequences in each file is the same (but the length of sequences in different files are different. For example in File 1, the length of all sequences is 120, and in file 2 all of them are 78)
  • Sequences are arranged base on the "ID" in these files (i.e. the sequence IDs in the files are the same & are in the same order)
  • I've used cat command, but it just added sequences at the end of file!
  • Example files are presented below (I have shown only the first 4 sequences. there are around 1000 sequences). they are not real sequences.

I appreciate your help regarding to this matter, and please let me know if you need any more information.

Cheers, Ali

#

#

File 1:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
MLGWIAKKII

File 2:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--GWIAKKI-AAAA

File 3:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--GGYYYKLIAKKI-AAAA

File 4:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--------HHWSGYYYKLIAKKI-AAAA--D

File 5:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
AEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
EEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
EEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
EEE

File 6:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
A

So the final file should look like this: Result file:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DAEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
multiple-alignment • 4.3k views
ADD COMMENTlink modified 5.5 years ago by User000270 • written 5.5 years ago by akhalili260
1

show us a few lines of your fasta files please...

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum123k

Updated my question. please let me know if you need any more information.

ADD REPLYlink written 5.5 years ago by akhalili260
1

That looks like CLUSTAL format, or something, but is not at all fasta.

ADD REPLYlink written 5.5 years ago by pld4.8k
2

It's called FASTA multiple alignment format; similar to FASTA but gap characters ("-") are allowed - http://www.bioperl.org/wiki/FASTA_multiple_alignment_format.

ADD REPLYlink written 5.5 years ago by Neilfws48k

do you want to merge the sequences together or files? basically, it is the same ID?

ADD REPLYlink written 5.5 years ago by User000270

I didn't get your question but if you are trying to join files side by side instead of end to end, you can try "join" in UNIX. Check here: http://www.albany.edu/~ig4895/join.htm

ADD REPLYlink written 5.5 years ago by Ashutosh Pandey11k

Does all of your 1000 sequences have their name ending with "SecA"?

ADD REPLYlink written 5.5 years ago by Ashutosh Pandey11k
4
gravatar for pld
5.5 years ago by
pld4.8k
United States
pld4.8k wrote:

Here is a python script that will concat your alignments, assuming each alignment has the same id in the fasta header

import sys
import re
def __main__():
    infile  = sys.argv[1]
    outfile = sys.argv[2]

    with open(infile, 'rb') as fi:
        seqs = fi.read().split('>')[1:]
        seqs = [x.split('\n')[:2] for x in seqs]

    merge = dict()

    #index 0 is header
    #index 1 is seq
    for x in seqs:
        key = re.search('(?<=gi\|)([0-9]*)(?=\|)', x[0]).group(0)
        try:
            merge[key] = merge[key] + x[1]

        except KeyError:
            merge[key] = x[1]


    #write results to file
    with open(outfile, 'w') as fi:
       for x in merge.keys():
           fi.write('>' + x + '\n')
           fi.write(merge[x] + '\n')

__main__()

To put your individual files into a single file:

cat f1.fasta, f2.fasta, ... > bigfile.fasta

You would then run the above script with:

python myscript.py bigfile.fasta outputfile.fasta

Just remember to adjust the file names here to whatever you're working with.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by pld4.8k

Sorry for the errors in the code, I verified the regex but typed it into the post, hence the errors. Everything works now.

ADD REPLYlink written 5.5 years ago by pld4.8k
1
gravatar for Ashutosh Pandey
5.5 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

If I understand correctly, you have six files (one for each domain) and each file having 1000 sequences.

1) Excluding the clustalw file for the first domain, run the following command for all the other five files. Make sure you have backed up your files so if something goes wrong, your original files remain unchanged.

sed -i ';s/.* //' File_name_2

sed -i ';s/.* //' File_name_3

........

sed -i ';s/.* //' File_name_6

2) Then use join command in unix and concatenate the first file with the second , third .. sixth file.

join File_name_1 File_name_2 ... File_name_6 > Mega_file

Hope it works.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Ashutosh Pandey11k

I'm new to scripting! I got this when running sed -i ';s/.* //' domain2_cut_clustal

sed: 1: "domain2_cut_clustal": extra characters at the end of d command

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by akhalili260

There should be a gap between "'" and domain_cut_clustal

ADD REPLYlink written 5.5 years ago by Ashutosh Pandey11k

Thanks for your help. It does the job of combining the files. BUT, When I tried to use the final "Mega_file", i realized that it is not a Clustal file! since in the middle of the file some of the lines don't meet the Clustal's standard. (i. e. some of the lines have less than 60 sequence symbols. This is because, for example, the last lines of file 2, which are in the middle of the "Mega_file", have 24 sequence symbols (which i believe will be interpreted as the end of file!))

any suggestion for turning this "Mega file" into a Fasta file or Clustal file?

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by akhalili260
1
gravatar for Neilfws
5.5 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

Essentially, what you want to do regardless of programming language is:

  1. Parse a set of "fasta-like" files (they are FASTA multiple alignment format which for all intents and purposes is fasta)
  2. Extract the ID, description and sequence
  3. For each unique ID + description, concatenate the associated sequences
  4. Print results

So here is a quick, dirty and ugly solution which employs BioPerl SeqIO. I'm assuming a Linux-like system with a command line. BioPerl installation on Ubuntu-based Linux is as simple as sudo apt-get install bioperl.

First, assuming that your sequence files are the only files in their directory and they end with the suffix ".fa", concatenate them into one file:

cat *.fa > allseqs.fa

Now the BioPerl; save this as merge_seqs.pl:

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my %seqs; # a hash to hold the results
my $file = shift or die("Usage = merge_seqs.pl <fasta file>\n");
my $seqio = Bio::SeqIO->new(-file => $file, -format => "fasta");

while(my $seq = $seqio->next_seq) { 
  $seqs{$seq->display_id." ".$seq->description} .= $seq->seq;
}
# now print
foreach my $key (keys %seqs) { 
  print ">$key\n$seqs{$key}\n";
}

and run like so (output to the file merged.fa):

perl merge_seqs.pl allseqs.fa > merged.fa
ADD COMMENTlink written 5.5 years ago by Neilfws48k
0
gravatar for ole.tange
5.5 years ago by
ole.tange3.5k
Denmark
ole.tange3.5k wrote:

This ought to do it:

unix$ cat file* | perl -pe 's/\r\n?/\n/' | perl -ne 'chomp;if(/^>/) { $id = $_ } else { $content{$id}.=$_} END{print map { "$id\n$content{$id}\n" } keys %content}'  > file.all

It will work even if the order of the sequences inside the files are flipped around or if the lengths are different inside the files. It requires all files to be small enough to fit in memory.

Edit:

Modified to accept Apple \r and Microsoft \r\n line separation, too.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by ole.tange3.5k

I ran your script. It took the last sequence of the last file and repeated it over 3000 times in the "file.all" file!

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by akhalili260

I have re-tested it on the 6 files you give, and I still get the correct result. Is there something very different hiding in your other files? Can you reproduce my results by running it on your 6 example files?

ADD REPLYlink written 5.5 years ago by ole.tange3.5k
1

Were these files ever saved on Windows? That might be the issue.

ADD REPLYlink written 5.5 years ago by pld4.8k

That might just explain the results. Edited answer to support both Windows and Mac. Does it work now?

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by ole.tange3.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 579 users visited in the last hour