Question: Merge Files (Adding File Names To Sequence Names)
2
gravatar for biolab
5.4 years ago by
biolab1.1k
biolab1.1k wrote:

Dear all,

I have many sequence files. An example is shown below.

FILE1:  Homo sapiens
>Act1
tgtgcatgcacatgactgaccaca
>Tub1
gttgcatgcatgcatgacgtacactg
...............


FILE2: C elegans
>Act1
tgcatgtgactgactgac
>Tub1
catgcatgactgcatgactgactg
.................


FILE3: A thaliana
>Act1
actgactgtacgtgactgactactgac
>Tub1
tgtctgatgcatgcatgactgactgtgacactg
.................

I need to merge all files and add species name to sequence name, the output is like below.

>Act1-Homo sapiens
tgtgcatgcacatgactgaccaca
>Tub1-Homo sapiens
gttgcatgcatgcatgacgtacactg
>Act1- C elegans
tgcatgtgactgactgac
>Tub1- C elegans
catgcatgactgcatgactgactg
>Act1-A thaliana
actgactgtacgtgactgactactgac
>Tub1-A thaliana
tgtctgatgcatgcatgactgactgtgacactg

Would anyone please inform me a perl, awk or shell code to finish this? Thank you very much!!

perl awk • 5.6k views
ADD COMMENTlink modified 3.6 years ago by Biostar ♦♦ 20 • written 5.4 years ago by biolab1.1k

Be aware that if you keep the space in the species name, it will generate a break in the FASTA header. So for example in the first line, Act1-Homo will be interpreted as the sequence ID and sapiens as a sequence description.

ADD REPLYlink written 5.4 years ago by Neilfws48k

Yes, it should be considered.

ADD REPLYlink written 5.4 years ago by biolab1.1k
4
gravatar for Pierre Lindenbaum
5.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:
awk '/^>/ {printf("%s-%s\n",$0,FILENAME);next;} {print;}' "Homo sapiens" "C elegans" "A thaliana"
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Pierre Lindenbaum120k
4

Hi Pierre, here is a shorter version (I like to code golf): awk '{print /^>/ ? $0"-"FILENAME : $0}' *. If the line starts with >, print the line $0 + the FILENAME, else print the line. I replaced the list of files by a star, to be sure to adapt to the situation where many files have to be treated.

ADD REPLYlink written 5.4 years ago by Frédéric Mahé2.9k
1

Thank you Frederic :-)

ADD REPLYlink written 5.4 years ago by Pierre Lindenbaum120k

Thank you both. The awk commands are really cool.

ADD REPLYlink written 5.4 years ago by biolab1.1k

Hi Pierre, Thank you greatly for your answer. However, the above is an example only. I have many files. I wrote a perl script (I am a true beginner) to do this job.

#!/usr/bin/perl
use strict;
use warnings;

my @files = @ARGV;

foreach my $filename (@files){
    $a = shift $filename;
    while (my $line = <>){
    chomp $line;
    if ($line =~/>/){
        print "$line" . '-' . "$a\n";
        } else {
        print "$line\n";}
    }
}

My command is $ perl merge.pl *.fa
The error messag is Not an ARRAY reference at merge.pl line 8. Could you please check the above script and correct it? Thank you very much!

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by biolab1.1k
1

Correcting your code:

#!/usr/bin/perl
use strict;
use warnings;

my @files = @ARGV;

foreach my $filename (@files){
    open FH, "$filename" or die;
    while (my $line = <FH>){
    chomp $line;
    if ($line =~/>/) {
        print "$line" . '-' . "$filename\n";
    } else {
        print "$line\n";}
    }
    close FH;
}

Check that $filename contains your file name and need to be used to read the file content as FH. After each cycle, you need to close it.

ADD REPLYlink written 5.4 years ago by JC7.8k
1

Well done! In case you might be interested, you could do the following:

#!/usr/bin/perl
use strict;
use warnings;

my @files = @ARGV;

foreach my $filename (@files) {
    open my $fh, '<', $filename or die $!;
    while (<$fh>) {
        chomp;
        print /^>/ ? "$_-$filename\n" : "$_\n";
    }
}

This uses the three-argument form of open, including lexical file handles. Note that close $fh; is missing. Perl will automatically close the file when a new handle is assignned to $fh. Also, the above takes advantage of Perl's default scalar $_ when chompping and matching. Finally, the if-then-else is replaced by the ternary operator.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Kenosis1.2k
3
gravatar for Kenosis
5.4 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Here's another Perl option:

use strict;
use warnings;

for my $file ( grep $_ ne $0, <*> ) {
    open my $fh, '<', $file or die $!;

    while (<$fh>) {
        chomp;
        s/^>.+\K/-$file/;
        print "$_\n";
    }
}

Usage: perl script.pl [>outFile]

The last, optional parameter directs output to a file.

Drop the script into the directory where only your fasta files live that you want to combine. The script will read all the file names in that directory--filtering out its own name--iterate through all, and append the current file name it's reading to the end of each fasta header line of the file.

It's problematic just sending file names on the command line to such a script, since you'd need to enclose each name within quotes, as Pierre Lindenbaum did with his example, because of the spaces. This solution bypasses the need to send those names to the script.

How does this work? The script uses a glob (<*>) to get a listing of all files in the current directory, then greps each against the script's name, so it's not going to read in the script. As each file line is read, the input record separator (usually \n) is chompped off, and a substitution--which looks for a > at the start of a line--is used. The substitution matches the complete header line, but the \K says "\Keep all to the left," so a hyphen and the file name is added to that line. Finally, that line is printed. (Note there is no close $fh;. Perl will automatically close the opened file when a new file handle is assigned to $fh.)

The following Python script produces the same results:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import re

for fileName in os.listdir('.'):
    if fileName == os.path.basename(__file__) or os.path.isdir(fileName):
        break

    inFile = open(fileName, 'r')
    for line in inFile:
        print re.sub(r'^(>.+)', r'\1-%s' % fileName, line.rstrip('\n'))

Usage: python script.py [>outFile]

Hope this helps!

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Kenosis1.2k

Hi Kenosis and JC, your solutions are really helpful, and make me better understand perl. Thank you very much!

ADD REPLYlink written 5.4 years ago by biolab1.1k

You're most welcome, biolab!

ADD REPLYlink written 5.4 years ago by Kenosis1.2k
2
gravatar for Malcolm.Cook
5.4 years ago by
Malcolm.Cook1.0k
kansas, usa
Malcolm.Cook1.0k wrote:

here's a perl one-liner to call from the command line that you can use with your multiple fasta files:

perl -lpe '$_.="-$ARGV" if m/^>/' *.fa
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Malcolm.Cook1.0k

Hi Malcolm, thank you very much!

ADD REPLYlink written 5.4 years ago by biolab1.1k

Simple and very useful!! Thank you!!

ADD REPLYlink written 4.5 years ago by Prakki Rama2.2k
1
gravatar for voorloopnul
5.4 years ago by
voorloopnul20
voorloopnul20 wrote:

Hello biolab, I don't know if you are a beginner in Perl or in programming. If you are starting at programming too, I would suggest you to try python, as it is closer to english. Here is a longer version that is easier to read and understand:

#!/usr/bin/python
import os

def merge(file_name):
    data = []
    with open(file_name) as fh:
        for line in fh.readlines():
            line = line.replace("\n", '')
            if line.startswith('>'):
                new_header = (line+" -"+file_name)
                data.append(new_header+"\n")
            else:
                data.append(line+"\n")

    with open('output','a') as fh:
        fh.write(''.join(data))

for file_name in os.listdir('.'):
    if file_name.endswith(".fa"):
        merge(file_name)
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by voorloopnul20
1

Thanks, tarzvf, I am new in programming. Recently I spend much time on perl. I can learn Python in the meantime, which i believe is very useful. Actually I mostly focus on lab work, but the increasing sequencing technology urges me to begin leraning some R, shell and perl programming. I really find these bioinformatics tools powerful and useful. Thank you for your suggestions.

ADD REPLYlink written 5.4 years ago by biolab1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1462 users visited in the last hour