merging fasta sequences into a single fasta sequence in large file
1
0
Entering edit mode
2.8 years ago
K ▴ 10

Hi,

have a large file containing more than 10 contigs of different length.

Input:

seq_1 AAGGGTTTAGAAAAAAACCAAACAAACAATCGAAACGAAATAGAAAAAGAAAAAGGGAAGGGGTTAAGTTC

seq_2 AAGGGTTTAGAAAAAAACCAAACAAACAATCGAAATGAAATAGAAAAAGAAAAAGGGAAGGGGTTAAGTTC

seq_3 TTCATATAAAAATTGATATAGAATCTTTGAAAAAGACCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTT

seq_4 TTCATATAAAAATTGATATAGAATCTTTGAAAAAGCCCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTT

seq_5 TTCATATAAAAATTGATATAGAATCTTTGAAAAAGCCCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTT

seq_6 AAGGGTTTAGAAAAAAACCAAACAAACAATCGAAACGAAATAGAAAAAGAAAAAGGGAAGGGGTTAAGTTC

seq_7 AAGGGTTTAGAAAAAAACCAAACAAACAATCGAAATGAAATAGAAAAAGAAAAAGGGAAGGGGTTAAGTTC

seq_8 TTCATATAAAAATTGATATAGAATCTTTGAAAAAGACCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTT

seq_9 TTCATATAAAAATTGATATAGAATCTTTGAAAAAGCCCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTT

seq_10 TTCATATAAAAATTGATATAGAATCTTTGAAAAAGCCCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTT

I would like to merge some of the fasta sequences based on IDs

Output:

seq_1_seq_2 AAGGGTTTAGAAAAAAACCAAACAAACAATCGAAACGAAATAGAAAAAGAAAAAGGGAAGGGGTTAAGTTCAAGGGTTTAGAAAAAAACCAAACAAACAATCGAAATGAAATAGAAAAAGAAAAAGGGAAGGGGTTAAGTTC

seq_3 TTCATATAAAAATTGATATAGAATCTTTGAAAAAGACCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTT

seq_4 TTCATATAAAAATTGATATAGAATCTTTGAAAAAGCCCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTT

seq_5 TTCATATAAAAATTGATATAGAATCTTTGAAAAAGCCCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTT

seq_6_seq_7 AAGGGTTTAGAAAAAAACCAAACAAACAATCGAAACGAAATAGAAAAAGAAAAAGGGAAGGGGTTAAGTTCAAGGGTTTAGAAAAAAACCAAACAAACAATCGAAATGAAATAGAAAAAGAAAAAGGGAAGGGGTTAAGTTC

seq_8_seq_9_seq_10 TTCATATAAAAATTGATATAGAATCTTTGAAAAAGACCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTTTTCATATAAAAATTGATATAGAATCTTTGAAAAAGCCCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTTTTCATATAAAAATTGATATAGAATCTTTGAAAAAGCCCTTTCTTCCTAAGAAAGAAAAGGCTTACTGTCTT

Thank you

fasta sequences • 717 views
ADD COMMENT
1
Entering edit mode

What have you tried? Are you hoping someone will just give you a complete script?

ADD REPLY
1
Entering edit mode
2.8 years ago
Michael 54k

Here is a perl script I am using to join sequences from multiple fasta files by their sequence id.

Usage:

combineFasta.pl -out outputfile.fasta -in file1.fasta file2.fasta .... <more files>

Perl code:

#!/usr/bin/env perl

use strict;
use warnings;
use Getopt::Long;
use Bio::SeqIO;
use File::Basename qw(basename);


my @inputs = ();
my $out;
GetOptions ("out=s" => \$out, "in=s{1,}" => \@inputs);

my %h = ();

FILE: foreach my $f (@inputs) {
  print $f,"\n";
  my $in = Bio::SeqIO->new(-file => $f);
  while (my $seq = $in->next_seq) {
    my $gid = $seq->display_id;
    if (! ref $h{$gid}) {
       $h{$gid} = $seq;
    } else {
      $h{$gid}->seq ($h{$gid}->seq . $seq->seq);
    }
  }
}
print "finished reading input \n";

my $out = Bio::SeqIO->new(-format => 'fasta', -file => ">$out");
while (my ($k, $v) = each %h ) {
   $out->write_seq($v);
}
ADD COMMENT

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6