Question

eliminate redundant lines in perl

0

Entering edit mode

8.2 years ago

cabraham03 ▴ 30

Hi, I'm trying to make a script to extract uniques taxonomic assignation from a txt file obtained from QIIME analyses:

results.txt file:

k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardiaceae;Other    4.48159186143e-07
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardiaceae;g__  1.34447755843e-06
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardiaceae;g__Rhodococcus   4.48159186143e-07
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardioidaceae;g__   6.72238779214e-06
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardioidaceae;g__Kribbella  4.48159186143e-07
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Propionibacteriaceae;g__  2.24079593071e-06
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Pseudonocardiaceae;g__Pseudonocardia 2.24079593071e-06
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Pseudonocardiaceae;g__Pseudonocardia 2.24079593071e-06

what i want is extract the family or genera (example: f__Nocardioidaceae ) assignation in a non redundant list. in this case will be:

for family :

Nocardiaceae
Nocardioidaceae
Propionibacteriaceae
Pseudonocardiaceae

I have been used this code:

#!/usr/bin/perl -w
use strict;

while ( <>) {
    $line = $_;
    chomp($line);

    if ($line=~ m/^#/g) {
        next;
    }

    elsif ($family) {
        my @fam= ($line=~ m/f__[\W]?(.*)[\W]?;g/g);

        foreach(@fam){
            if ($_=~ m/^$/g) {
                next;
            }

            else {
                my @uniq_list = uniq(@fam);
                print  "$_\n";
            }
        }
    }

    else {
        print "ERROR\n";
        exit;
    }
}

# Second option using a subroutine:

sub uniq {
    my %seen;
    grep !$seen{$_}++, @_;
}

#and then......

else {
    my @uniq_list = uniq(@fam);
    print  OUTFILE "$_\n";
}
}

None of them works !!!

Thanks so much !!!

perl qiime 16s metagenomics • 2.1k views

ADD COMMENT • link updated 8.2 years ago by mastal511 ★ 2.1k • written 8.2 years ago by cabraham03 ▴ 30

0

Entering edit mode

Hi, please use code formatting for code. Please be specific, what is it that doesn't work? Error message, no output, ...? Also, if possible try to reduce the script posted to the essentials, remove getopt, file input and output open. Using a script in your case that reads from stdin and prints to stdout is totally sufficient.

Further do you need the list in any specific order?

ADD REPLY • link 8.2 years ago by Michael 54k

score 1 · Answer 1 · 2016-03-05

Here are a couple of hints.

In perl, hash keys have to be unique string identifiers. If you create a hash, and put the extracted family terms as the keys of your hash, then you will have the non-redundant list of family terms found in your input file.

In perl regular expressions \W is a non-word character, so I think what you want is really \w.

score 1 · Answer 2 · 2016-03-05

Here is an _untested_ reduced perl script derived from yours, that does what it is supposed to:

#!/usr/bin/env perl
use strict;
use warnings;

my %uniq = ();
while (<>) {
  next if (/^#/);
  /f__\W?(.+)\W?;/ #your regex was somehow ok, just that yours wouldn't parse the first line of input
 # are you expecting non-word characters (\W) at the family boundary?  
  $uniq{$1} = $1 if $1; 
}
print join "\n", keys %uniq;

Ram · Answer 3 · 2016-03-06

sorry, but I have used this page just a few times, I just don't know you to put the con in the appropriated way. well what I want, and steel don't know how to make it work is the next:

what I got with the code (script) is a list like the next: (the code works well until here)

Hyphomicrobiaceae
Hyphomicrobiaceae
Methylobacteriaceae
Methylobacteriaceae
Methylocystaceae
Methylocystaceae
Phyllobacteriaceae
Phyllobacteriaceae
Rhizobiaceae
Rhizobiaceae
Rhodobiaceae
Xanthobacteraceae
Xanthobacteraceae
Hyphomonadaceae
Hyphomonadaceae
Rhodobacteraceae
Rhodobacteraceae
Rhodobacteraceae
Acetobacteraceae
Acetobacteraceae
Rhodospirillaceae
Rhodospirillaceae

but what I want is eliminate the redundant and just have one of them:

Hyphomicrobiaceae
Methylobacteriaceae
Methylocystaceae
Xanthobacteraceae
Hyphomonadaceae
Rhodospirillaceae
Acetobacteraceae

I tried want you advice me, but it still print all instead one of them.

this is the full script that I'm using:

Thanks so much to all of you!

score 0 · Answer 4 · 2016-03-06

0

Entering edit mode

8.2 years ago

James Ashmore ★ 3.4k

If you felt like trying a simpler albeit limited method, you could use the builtin unix commands for this task:

cut -d ";" -f 5 test.txt | sort | uniq | sed 's/f__//g'

ADD COMMENT • link 8.2 years ago by James Ashmore ★ 3.4k

Ram · Answer 5 · 2016-03-06

0

Entering edit mode

8.2 years ago

mastal511 ★ 2.1k

# declare hash
my %fam; 


# read through input file one line at a time
# extract names and store as  hash keys
if ($line=~ m/f_([A-Z][a-z]+);/){
               $fam{$1}++;
}


# after reading through the whole input file
# sort hash and print results, which should be unique

foreach my $family (sort keys %fam){
              print $family, "\n";
}

ADD COMMENT • link updated 8.2 years ago by Ram 43k • written 8.2 years ago by mastal511 ★ 2.1k