Question

Interleaving Records From More Than One Fasta File

4

Entering edit mode

12.6 years ago

aindap ▴ 120

I am trying to run a model in PAML that requires me to combine records from multiple fasta files. I have 6 FASTA files, each with the same number of records. What I want to do is interleave the records into a single file such that my result file has:

>record1_fileOne
AAA
>record1_fileTwo
AAA
>record1_fileThree
AAA
>record1_fileFour
AAA
>record1_fileFive
AAA
>record1_fileSix
AAA
>record2>fileOne
GGG

I wrote the code below which just concatenates the fasta records, not interleaving them. I think there is probably some sort of trick I can use using python itertools that I'm just not seeing. Can anyone point me in the right direction?

I found this script that interleaves 2 fasta files, but I need to extend it to N fasta files:

def read_fasta(fh):
    """ generator for reading a fasta record: taken from [http://stackoverflow.com/a/7655072/1735942][2] """
    name, seq = None, []
    for line in fh:
        line = line.rstrip()
        if line.startswith(">"):
            if name: yield (name, ''.join(seq))
                name, seq = line, []
            else:
                seq.append(line)
        if name: yield (name, ''.join(seq))




fastafiles=args[0:]
filehandles=list(itertools.imap(open, fastafiles)) #list of filehandles for fasta files 

for fh in filehandles:
    for id,seq in read_fasta(fh):
        print id
        print seq

python paml fasta • 8.7k views

ADD COMMENT • link updated 12.6 years ago by brentp 24k • written 12.6 years ago by aindap ▴ 120

Ram · Answer 1 · 2013-03-24

linearize each fasta file and put them into paste. An example with 3 files:

paste <(awk '/^>/ { printf("\n%s_file1\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' file1.fasta ) <(awk '/^>/ { printf("\n%s_file2\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' file2.fasta )  <(awk '/^>/ { printf("\n%s_file3\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' file3.fasta ) | tr "\t" "\n"

Ram · Answer 2 · 2013-03-27

this is a nice chance to use some of python's itertools machinery. First the code:

from itertools import groupby, izip, chain
import sys

def fasta_iter(fasta_name):
    fh = open(fasta_name)
    faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
    for header in faiter:
        header = header.next()[1:].strip()
        yield header, "".join(s.strip() for s in faiter.next())

# interleave the fastas with izip and chain the results. this is all lazy
fastas = chain.from_iterable(izip(*[fasta_iter(fa) for fa in sys.argv[1:]]))

for header, seq in fastas:
    print ">", header
    print seq

the fasta_iter just takes a fasta file name, and creates a generator that yields the header and the sequence.

Note that it's using the groupby function and splitting into new groups based on the apearance of ">" as the first element in the line.

Once it finds the header group by the presence of ">", it knows that anything in the next group (without ">") is the sequence and it can just merge lines from that group.

it uses that to create a lazy list of iterables with:

[fasta_iter(fa) for fa in sys.argv[1:]]

that only reads from each file on demand. Since you want to interleave the records, you call izip, which is the lazy version of python's zip:

>>> zip("abc", "123")
[('a', '1'), ('b', '2'), ('c', '3')]

and it can handle as many arguments as you want.

Then we use itertools.chain.from_iterable so that we can use only a single loop to iterate over the files -- otherwise, with zip, we'd need an extra loop.

Ram · Answer 3 · 2013-03-25

3

Entering edit mode

12.6 years ago

Alex Reynolds 36k

Here's a way to do it without using external libraries, which also demonstrates how to interleave an arbitrary number of input files:

#!/usr/bin/env perl   

use strict;
use warnings;

#
# let's say your files are named file-1.fa through file-6.fa...
#

#
# open file handles for each of the six files, storing a reference to the 
# file handle in a list
#
my $fhRef;
foreach my $idx (1..6) {
  my $sourceFn = "file-$idx.fa"; 
  local *FILE;
  open (FILE, "< $sourceFn") or die "Could not open $sourceFn\n";
  $fhRef->[$idx] = *FILE;
}

#
# interleave data and print to standard output until we reach EOF on 
# the last file
#
# NOTE: this assumes sequence data are on one line, i.e., that the 
# header and sequence lines alternate on separate lines
#
my $fhsOpen = 0;
while ($fhsOpen == 0) {
  foreach my $idx (1..6) {
    my $fh = $fhRef->[$idx];
    my $header = <$fh>;
    my $sequence = <$fh>;
    print STDOUT "$header$sequence";
    if ((eof($fh)) && ($idx == 6)) {
      $fhsOpen = 1;
    }
  }
}

#
# close handles, for general file I/O hygiene
#
foreach my $idx(1..6) {
  close $fhRef->[$idx];
}

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 12.6 years ago by Alex Reynolds 36k

3

Entering edit mode

This solution is problematic, for several reasons. Programatically, these include:

The three-argument open should be used
Lexically-scoped file handles should be used

Additionally, how would your script handle the following valid fasta record?

>YAL069W-1.334 Putative promoter sequence
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC
TGTTCTTCTACCCACCATATTGAAACGCTAACAA

It can't. You can easily fix the first two; and when using lexically-scoped variables for files handles, the files automatically close when those variables fall out of scope or the program ends. The third issue's solved by initializing a local copy of Perl's record separator to '>': the fasta record separator. For example, the following does this and interleaves the fasta records:

use strict;
use warnings;

local $/ = '>';
my %files = map { open my $fh, '<', $_ or die $!; $_ => $fh } @ARGV;

while (1) {
    for (@ARGV) {
        defined( my $record = readline $files{$_} ) or exit;
        next if $record eq $/;
        chomp $record;
        $record =~ s/\s+\z//;
        print $/ . "$record\n";
    }
}

Usage:

perl script.pl fastaFile [fastaFiles ... ] [>outputFile]

I applaud(!) you giving the OP a non-module option, but it's also beneficial to become familiarized with and use the well-developed bio modules which live for such tasks.

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 12.6 years ago by Kenosis ★ 1.3k

1

Entering edit mode

I was offering a solution that works with the input as described, though you are correct that it would not handle multiline sequence data without extra work. I like your use of map to set up the handle table, though. Much nicer. I was sloppy with the open call so I added a die message in case the file can't be opened.

ADD REPLY • link 12.6 years ago by Alex Reynolds 36k

Ram · Answer 4 · 2013-03-24

2

Entering edit mode

12.6 years ago

Kenosis ★ 1.3k

Here's a Perl option that uses Bio::SeqIO:

use strict;
use warnings;
use Bio::SeqIO;

my @fastaFiles = qw(file1.fasta file2.fasta file3.fasta file4.fasta file5.fasta file6.fasta);
my %SeqIO_Objects;

my $outFasta = Bio::SeqIO->new( -file => '>outFile.fasta', -format => 'Fasta' );

$SeqIO_Objects{$_} = Bio::SeqIO->new( -file => $_, -format => 'Fasta' )
  for @fastaFiles;

while (1) {
    for my $fastaFile (@fastaFiles) {
        $outFasta->write_seq( $SeqIO_Objects{$fastaFile}->next_seq() or exit );
    }
}

This creates a hash of Bio::SeqIO objects, and reads one fasta record from each object within the while, and then writes those records to outFile.fasta. It exits the while loop when a read fails (no more records).

Here's a Python option that uses Biopython:

from Bio import SeqIO

fastaFiles = [
    'file1.fasta',
    'file2.fasta',
    'file3.fasta',
    'file4.fasta',
    'file5.fasta',
    'file6.fasta',
    ]
allRecords = [list(SeqIO.parse(open(fastaFile, 'rU'), 'fasta'))
              for fastaFile in fastaFiles]

outFasta = open('outFile.fasta', 'w')

for i in range(0, len(allRecords[0])):
    for fastaRecord in allRecords:
        SeqIO.write(fastaRecord[i], outFasta, 'fasta')

outFasta.close()

This uses a list comprehension to create a list of lists, where each list is the set of Fasta records from each Fasta file. It then iterates through these lists to interleave the files' records, and writes those out to outFile.fasta.

Hope this helps!

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 12.6 years ago by Kenosis ★ 1.3k

1

Entering edit mode

this solution is needlessly reading all of the fasta files into memory.

ADD REPLY • link 12.6 years ago by brentp 24k

0

Entering edit mode

Yes, it is, as I was unable to locate BioPython's equivalent to BioPerl's fasta next() method.

ADD REPLY • link 12.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

I'd bet my fasta that SeqIO.parse() returns an iterable.

ADD REPLY • link 12.6 years ago by brentp 24k

0

Entering edit mode

The documentation says yes: "The main function is Bio.SeqIO.parse() which takes a file handle and format name, and returns a SeqRecord iterator." (cf.http://biopython.org/wiki/SeqIO)

ADD REPLY • link 12.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Appreciate this, and have provided an updated script in a reply.

ADD REPLY • link 12.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

You'd win that bet and I appreciate your comment. Have literally written only a few Python scripts, so am on a learning curve. Would now just do the following with the iterable:

import sys
from Bio import SeqIO

gens = [SeqIO.parse(open(fastaFile, 'rU'), 'fasta') for fastaFile in
        sys.argv[1:]]

while True:
    for i in range(0, len(gens)):
        try:
            inRec = gens[i].next()
            print inRec.id + '\n' + inRec.seq
        except:
            exit()

Usage:

python script.py file1.fasta file2.fasta [...] [>fileOut.fasta]

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 12.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Here, and in your comments, while (1) {} is superfluous in the Perl code. BioPerl will crash if it gets no files before you get to the while loop, and in your example in the comments to Alex's answer, open will die if gets no files before you get to the while loop. Why do you think it's necessary? The only reason I can think of is you're trying to check if you got any files, but you will never get to that point in that situation.

ADD REPLY • link 12.6 years ago by SES 8.6k

0

Entering edit mode

Niether have to do with files, but rather file lines, as both Perl scripts read sets of six lines at a time from the files, and there's no way to know how many sets there are, so the while (1) { ... is used to keep reading those sets and is exited when no more sets can be read.

ADD REPLY • link 12.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

When would a for (...) {} loop alone not do that for you? Also, I think you are reading/writing one line from one file at a time, not "six lines at a time from all the files" as you say.

ADD REPLY • link 12.6 years ago by SES 8.6k

0

Entering edit mode

What would be the iterable in that for loop such that you know the number of lines in the fasta files? Also, the script is reading sets of six lines at a time--not six lines at a time.

ADD REPLY • link 12.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

You are right, I stand corrected :). On a multi-fasta, while(1) is needed. Nice work.