Problem, getting information only from the last row and not from all the rows
0
0
Entering edit mode
6.1 years ago

Hi everyone. I trying to measure the frequency of, how often in my data I have synonyms and not synonymous mutations of a certain frequency.

So here is my data(input):

EOG090X0003;96;2;70;G:0.985714;A:0.0142857;G;SYN
EOG090X0003;186;2;60;G:0.983333;A:0.0166667;G;SYN
EOG090X0003;342;2;86;C:0.976744;T:0.0232558;C;NON
  

I.m trying to get this output:

Frequency Syn Non

0.01       1       
0.015      1
0.02           1
0.025
  

Problem with my code is that I only get information about the last gene and not for all tre genes. I will be really greatful for all you help.

#!/usr/bin/perl -w

use strict;
use warnings;
use 5.010;

my $file = $ARGV[0];                                    

if ( ! open ( FILE_HANDLE , "<" , $file ) ) {      
    die "Error can't find the file: $file because $!";
}

if ( ! open ( OUT_HANDLE , ">" , $file . ".csv" ) ) { 
    die "Error can't make new file file: ${file}.csv because $!";
}

print OUT_HANDLE "FREQ\tSYN\tNON_SYN\n";

my $syn_cnt = 0;
my $non_cnt = 0;
my $stop_cnt = 0;
my $freq;
my $annotation;
my %hashSYN = ();
my %hashNON = ();
my @freqLst = (0.01,0.015,0.02,0.025);                  
my $key;

while ( my $line = <FILE_HANDLE> ) {
  chomp $line;

  my @words = split(/\t/, $line);

 for my $row (@words){ 
    if ($row =~ m /\?/){
        next;
        #EOG09371BTU;212;2;94;G:0.989362;A:0.0106383;G;NON
    }elsif($row =~ m /^\w+\;\d+\;\d+\;\d+\;\w+\:\d.+\;\w+\:(\d.+)\;\w+\;(\w+)$/){

           $freq= $1;
           $annotation = $2;

           for my $pos (@freqLst){
               if($freq > $pos){
                  $key = $pos;
               }else{
                    if($annotation eq 'SYN'){
                       $syn_cnt = $hashSYN{$key};
                       $syn_cnt++;       
                       $hashSYN{$key} = $syn_cnt;
                    }elsif($annotation eq 'NON'){
                           $non_cnt = $hashNON{$key};
                           $non_cnt++;       
                           $hashNON{$key} = $non_cnt;
                    }
               }
            }
     }  
 }

}
perl • 1.6k views
ADD COMMENT
1
Entering edit mode

Just a little re-write of your for loop

for my $pos (reverse @freqLst){
  if($freq > $pos){
    $key = $pos;
    $hash{$key}{$annotation}++;
    last;
  }
}

You only need a single hash to store your values, if you use a 2-level one you can based on the $annotation count it in the correct category. Also note that I go over the @FreqList in reverse order, so assign the counts to the correct bin (in your case also 0.025 > 0.01 so it will not be counted in the correct bin (== will actually be counted in each bin lower than $freq )

ADD REPLY
0
Entering edit mode

Thank you so much for all your help.

ADD REPLY
0
Entering edit mode

what are the column (field) names? which one is frequency column and there are no genes in your example data. Assuming that 6th column has frequency data, OP output frequency is truncated unequally (0.01 - 2 digits after point vs 0.015 - 3 digits after point). It is also not clear if you want frequency per gene or overall. Instead of frequency, it seems you are simply counting (1) frequency for each category. It is clear that you would like solution in perl, but there are easier CLI programs to count.

ADD REPLY
0
Entering edit mode

Hi, thank you for your time, so here is one of my genes:

Column 1 Col2 Col3 Col4 Col5 . Col6 Col7 Col8 EOG090X0003 ; 96 ; 2 ; 70 ; G:0.985714 ; A:0.0142857 ; G ; SYN

I want to get the frequency in column 6 (0.0142857 ) see if it is between frequency values 0.01-0.015 or 0.015-0.02, so in this case, it is between 0.01-0.015 so I want to add with one (+1) in that specific row with frequency 0.01 under the synonymous column.

ADD REPLY
1
Entering edit mode

Your ranges over lap (0.01-0.015, 0.015-0.02). In your code, there are four frequencies: 0.01,0.015,0.02,0.025. Synonymous classification has two ranges (0.01 and 0.015) and same is the case with non-syn and same is in your expected output in OP. This is confusing.

Example code below in awk (syn range=0.01-<0.015, non-syn range= >=0.015 and <=0.02):

$ sed 's/[";",":"]/\t/g' test.txt | awk '{if($8>=0.01 && $8 < 0.015) {print "0.01-<0.015","\t",1} else if($8>=0.015 && $8 <= 0.02) {print "0.015-0.02","\t","\t",1}'} | awk  'BEGIN {OFS="\t";print "Frequency", "Syn","Non-Syn"} {print $0}'

Frequency        Syn    Non-Syn
0.01-<0.015      1
0.015-0.02              1

input:

 $ cat test.txt 
EOG090X0003;96;2;70;G:0.985714;A:0.0142857;G;SYN
EOG090X0003;186;2;60;G:0.983333;A:0.0166667;G;SYN
EOG090X0003;342;2;86;C:0.976744;T:0.0232558;C;NON
ADD REPLY
0
Entering edit mode

Nice approach but this is OK for a single input gene but I assume she wants to process a whole list of genes (and count the sum over all of them)? So one will need to extend the awk to account for this.

ADD REPLY
1
Entering edit mode

that is true. If OP furnishes sample data that would be helpful.

ADD REPLY
0
Entering edit mode

Thank you for your time and your help.

ADD REPLY
0
Entering edit mode

Is this your complete code/script? I only see a single print statement (at the top of your script).

ADD REPLY
0
Entering edit mode

Yes it is because I'm not done yet since I don't know how to save my counts of nr synonymous and nonsynonymous for a specific frequency row. I'm thinking of saving it in a list and print the list maby?

ADD REPLY
0
Entering edit mode

Are you restricted in using perl or would be other language ok as well?

ADD REPLY
0
Entering edit mode

I would prefer Perl since I'm trying to learn it and I'm new in this field but I can test other languages.

ADD REPLY
0
Entering edit mode

I'm all for PERL since I learned that few decades ago. however as you're 'new' to the field I might suggest (consider) starting with python rather than perl . This as a side not though ;)

ADD REPLY
0
Entering edit mode

Then how do you conclude you only have the data for the last row? Perl-debugger ?

ADD REPLY
0
Entering edit mode

I'm running my script on the terminal and I get the output on the screen, I print the counts and frequency but I get only for the last row instead of all 3 rows.

ADD REPLY
0
Entering edit mode

That I'm puzzled about ... to screen or to file you should do some print statement at the end of your script (after processing the lines) to see anything, no?

ADD REPLY

Login before adding your answer.

Traffic: 1394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6