Question: Replace Sequence IDs in many text files (newick format) - Change script
0
gravatar for ibasan
20 months ago by
ibasan30
ibasan30 wrote:

Dear all, i found this great script to replace sequence identifiers in a text file (newick format) with strings from a tab delimited file in this post

use strict;
use warnings;
my $treeFile = pop;
my %taxonomy = map { /(\S+)\s+(.+)/; $1 => $2 } <>;
push @ARGV, $treeFile;
while ( my $line = <> ) {
    $line =~ s/\b$_\b/$taxonomy{$_}/g for keys %taxonomy;
    print $line;
}

It works perfect for one text file (newick) but i have thousonds (<20000) of such text files (newick) and one tab delimited file. How can i change this script for all of my text files? Any help would be great!

newick tree replace perl • 861 views
ADD COMMENTlink modified 17 months ago by Biostar ♦♦ 20 • written 20 months ago by ibasan30
5
gravatar for jrj.healey
20 months ago by
jrj.healey8.6k
United Kingdom
jrj.healey8.6k wrote:

You don't need to modify the perl script to achieve this, just call the script in a loop in the Shell (if your script works the way I assume it does).

Something like:

for file in *.tree ; do perl scriptname.pl tabfile.tsv "${file}" ; done

I don't know if your script 'modifies-in-place' (I'm no perl-aficionado). If not, just redirect the STDOUT to make a new file:

for file in *.tree ; do perl scriptname.pl tabfile.tsv "${file}" > "${file%.*}"_modified.tree ; done

Would give you a new file with _modified.tree appended to the end, for every input tree you had:

tree1.tree -> tree1_modified.tree etc etc...

ADD COMMENTlink modified 20 months ago • written 20 months ago by jrj.healey8.6k
1

ibasan : If you have access to a cluster you could use this script to fire those jobs in parallel. Adjust do perl line to include relevant job scheduler commands (warning: test with a small set of files first before you fire off 20000 incorrect jobs).

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax58k

Hi genomax2, atm sridhar56 script is running. I tried your version like this:

!/usr/bin/perl
for file in *.txt ; do perl script.pl tabfile.csv "${file}" > "${file%.*}"_modified.tree;
done

But this did not work. Can you please explain what i have to do?

ADD REPLYlink modified 20 months ago • written 20 months ago by ibasan30

You either need to run that on the terminal using the line,

for file in *.txt ; do perl script.pl tabfile.csv "${file}" > "${file%.*}"_modified.tree;done

Or copy the script into a shell file, Ex: tree.sh,

#!/bin/bash
 for file in *.txt ; do perl script.pl tabfile.csv "${file}" > "${file%.*}"_modified.tree;done

and run on the terminal,

sh tree.sh
ADD REPLYlink written 20 months ago by sridhar56100

I tested it with 1000 files and it works! Thanks for helping me again!

ADD REPLYlink written 20 months ago by ibasan30
1
gravatar for sridhar56
20 months ago by
sridhar56100
sridhar56100 wrote:

So, to make your script work for the n number of file you have, these are the things to do:

  1. Read all the files from a directory and modify them
  2. Replace the ID's as done above
  3. Save the files to an array
  4. Print them out as individual files

Below is my code, I modified to fit your situation, you will have to modify it accordingly with your input files

ADD COMMENTlink modified 20 months ago • written 20 months ago by sridhar56100
#!/usr/bin/perl
use strict;
use warnings;
use Cwd;                      #Takes in the current path of the directory
use FileHandle;
my $cwd = getcwd();              #Takes in the current directory of files
my $opendir = "$cwd";               #Obtains the path of your current directory
my $treeFile = pop;
my %taxonomy = map { /(\S+)\s+(.+)/; $1 => $2 } <>;
push @ARGV, $treeFile;
my @file_name;
my @file;
my @directory = grep /tre$/, readdir DIR;  #assuming all your tree files have the tre as file extension
opendir (DIR, $opendir) or die $!;              #Opens the current directory, the path of all your files
my $scalar = scalar(@directory);                #Counting the number of files and printing it in the next line
for (my $i=0; $i<scalar(@directory); $i++){
my $name = $directory[$i];
print "The file read in: $name\n";
push @file_name,$name;
my $tmp = "$opendir/$directory[$i]";
open(FILE, $tmp) or die "Can't open $tmp\n";              #Reading the file and storing them in a string in the rep_line array
while ( <FILE> ) {
    $line = $_;
    $line =~ s/\b$_\b/$taxonomy{$_}/g for keys %taxonomy;
    #print $line;
    push @file,$line;
    }
}


mkdir "$cwd/Individual_tree_files"; 
for (my $i=0;$i<scalar(@file);$i++){
printer($file_name[$i],$file[$i]);                  #Calls the subroutine for printing to new files
}


sub printer { 
my @print_array = @_;
my $print_name = substr $print_array[0],1,-1;                           #Use a substring to grab the first part of the file name, modify accordingly
$print_name=~ s/\//_/g;
print "$print_name\n";
my $tree= "\.tre";
my $name_for_folder = "Individual_tree_files/";
my $name_for_file = "$name_for_folder$print_name$tree";                        #Creating the name for every output file
print "$name_for_file\n";
open(my $fh, '>',$name_for_file) or die "Could not open file '$name_for_file' $!";
 print $fh "$print_array[1]";                                                          #Printing the output file into the folder
 print "File for $print_name copied to Individual_tree_files\n";
 close $fh;
}
ADD REPLYlink modified 20 months ago by genomax58k • written 20 months ago by sridhar56100

Tried to format it, was unable to, if you copy paste the code to a text file, save it as perl file and then run the script, it should work

ADD REPLYlink written 20 months ago by sridhar56100
1

Highlight text you want to format, click on the 101010 button.

ADD REPLYlink written 20 months ago by genomax58k

Shall do that in the future!

ADD REPLYlink written 20 months ago by sridhar56100

Thanks a lot sridhar56 for helping me! Do i use the script as follow:

perl script.pl taxonomyfile.csv

in the directory with all my treefiles? I added my $line = $_; in the esction with variables And now its calculating...

ADD REPLYlink modified 20 months ago • written 20 months ago by ibasan30

The problem is, $line does not have a my before it at line 23. To make the script function well for your files, it would be best to copy all the taxonomy ID's within the script and assign them to the hash %taxonomy. Then map them to your tree files. That will work.

To keep it easy, if you have access to linux, try the solution suggested by,@jrj.healey. His solution is simpler than mine

ADD REPLYlink written 20 months ago by sridhar56100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 849 users visited in the last hour