eliminate redundant lines in perl
5
0
Entering edit mode
8.2 years ago
cabraham03 ▴ 30

Hi, I'm trying to make a script to extract uniques taxonomic assignation from a txt file obtained from QIIME analyses:

results.txt file:

k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardiaceae;Other    4.48159186143e-07
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardiaceae;g__  1.34447755843e-06
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardiaceae;g__Rhodococcus   4.48159186143e-07
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardioidaceae;g__   6.72238779214e-06
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardioidaceae;g__Kribbella  4.48159186143e-07
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Propionibacteriaceae;g__  2.24079593071e-06
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Pseudonocardiaceae;g__Pseudonocardia 2.24079593071e-06
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Pseudonocardiaceae;g__Pseudonocardia 2.24079593071e-06

what i want is extract the family or genera (example: f__Nocardioidaceae ) assignation in a non redundant list. in this case will be:

for family :

Nocardiaceae
Nocardioidaceae
Propionibacteriaceae
Pseudonocardiaceae

I have been used this code:

#!/usr/bin/perl -w
use strict;

while ( <>) {
    $line = $_;
    chomp($line);

    if ($line=~ m/^#/g) {
        next;
    }

    elsif ($family) {
        my @fam= ($line=~ m/f__[\W]?(.*)[\W]?;g/g);

        foreach(@fam){
            if ($_=~ m/^$/g) {
                next;
            }

            else {
                my @uniq_list = uniq(@fam);
                print  "$_\n";
            }
        }
    }

    else {
        print "ERROR\n";
        exit;
    }
}

# Second option using a subroutine:

sub uniq {
    my %seen;
    grep !$seen{$_}++, @_;
}

#and then......

else {
    my @uniq_list = uniq(@fam);
    print  OUTFILE "$_\n";
}
}

None of them works !!!

Thanks so much !!!

perl qiime 16s metagenomics • 2.1k views
ADD COMMENT
0
Entering edit mode

Hi, please use code formatting for code. Please be specific, what is it that doesn't work? Error message, no output, ...? Also, if possible try to reduce the script posted to the essentials, remove getopt, file input and output open. Using a script in your case that reads from stdin and prints to stdout is totally sufficient.

Further do you need the list in any specific order?

ADD REPLY
1
Entering edit mode
8.2 years ago
mastal511 ★ 2.1k

Here are a couple of hints.

In perl, hash keys have to be unique string identifiers. If you create a hash, and put the extracted family terms as the keys of your hash, then you will have the non-redundant list of family terms found in your input file.

In perl regular expressions \W is a non-word character, so I think what you want is really \w.

ADD COMMENT
1
Entering edit mode
8.2 years ago
Michael 54k

Here is an _untested_ reduced perl script derived from yours, that does what it is supposed to:

#!/usr/bin/env perl
use strict;
use warnings;

my %uniq = ();
while (<>) {
  next if (/^#/);
  /f__\W?(.+)\W?;/ #your regex was somehow ok, just that yours wouldn't parse the first line of input
 # are you expecting non-word characters (\W) at the family boundary?  
  $uniq{$1} = $1 if $1; 
}
print join "\n", keys %uniq;
ADD COMMENT
0
Entering edit mode
8.2 years ago
cabraham03 ▴ 30

sorry, but I have used this page just a few times, I just don't know you to put the con in the appropriated way. well what I want, and steel don't know how to make it work is the next:

what I got with the code (script) is a list like the next: (the code works well until here)

Hyphomicrobiaceae
Hyphomicrobiaceae
Methylobacteriaceae
Methylobacteriaceae
Methylocystaceae
Methylocystaceae
Phyllobacteriaceae
Phyllobacteriaceae
Rhizobiaceae
Rhizobiaceae
Rhodobiaceae
Xanthobacteraceae
Xanthobacteraceae
Hyphomonadaceae
Hyphomonadaceae
Rhodobacteraceae
Rhodobacteraceae
Rhodobacteraceae
Acetobacteraceae
Acetobacteraceae
Rhodospirillaceae
Rhodospirillaceae

but what I want is eliminate the redundant and just have one of them:

Hyphomicrobiaceae
Methylobacteriaceae
Methylocystaceae
Xanthobacteraceae
Hyphomonadaceae
Rhodospirillaceae
Acetobacteraceae

I tried want you advice me, but it still print all instead one of them.

this is the full script that I'm using:

Thanks so much to all of you!

ADD COMMENT
0
Entering edit mode

Why don't you just use my script then? It is nice your script has so many options and documentation, but it's a bit too large because it tries to do everything at once. At least for me to edit it, but if you want to make it output only unique bits, feel free to integrate my code.

ADD REPLY
0
Entering edit mode
8.2 years ago
James Ashmore ★ 3.4k

If you felt like trying a simpler albeit limited method, you could use the builtin unix commands for this task:

cut -d ";" -f 5 test.txt | sort | uniq | sed 's/f__//g'
ADD COMMENT
0
Entering edit mode
8.2 years ago
mastal511 ★ 2.1k
# declare hash
my %fam; 


# read through input file one line at a time
# extract names and store as  hash keys
if ($line=~ m/f_([A-Z][a-z]+);/){
               $fam{$1}++;
}


# after reading through the whole input file
# sort hash and print results, which should be unique

foreach my $family (sort keys %fam){
              print $family, "\n";
}
ADD COMMENT

Login before adding your answer.

Traffic: 2656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6