Question

scripting problem to use ProtComp output to identify secretome proteins

0

Entering edit mode

9.6 years ago

rob234king ▴ 610

To identify secretome proteins, one of the end stages of the pipeline requires a scripting solution.

I have a file with protein sequence names (one per line) of the same order as the protein sequence analysis tables in another file, shown below. What I need to do is if the sequence name from the file2 has a number above 0 in either the LocDB or PotLocDB column excluding the "Extracellular" row to delete that sequence ID from file1 and if possible produce a file like file3. I then have a secretome protein id file list.

file3

14024    M            3.0          En           2.0

File1:

14024
13025

File2:

Seq name: 14024, Length=261
Nuclear 0.0     0.0     0.00    0.00    0.26
Plasma membrane 0.0     0.0     0.76    0.39    3.70
Extracellular   0.0     0.0     0.32    0.13    1.06
Cytoplasmic     0.0     0.0     0.03    0.10    0.61
Mitochondrial   3.0     0.0     1.08    0.43    2.90
Endoplasm. retic.       2.0     0.0     0.22    0.37    0.43
Peroxisomal     0.0     0.0     0.06    0.10    0.22
Lysosomal       0.0     0.0     0.08    0.11    0.00
Golgi   0.0     0.0     0.18    1.12    0.57
Vacuolar        0.0     0.0     0.28    0.39    0.25

Seq name: 13025, Length=247
Nuclear 0.0     0.0     1.63    0.00    0.77
Plasma membrane 0.0     0.0     0.06    0.48    0.93
Extracellular   0.0     0.0     0.19    0.35    0.87
Cytoplasmic     0.0     0.0     0.26    0.04    0.18
Mitochondrial   0.0     0.0     1.16    1.34    6.67
Endoplasm. retic.       0.0     0.0     0.09    0.27    0.00
Peroxisomal     0.0     0.0     0.01    0.04    0.07
Lysosomal       0.0     0.0     0.00    0.13    0.03
Golgi   0.0     0.0     0.04    0.21    0.48
Vacuolar        0.0     0.0     0.00    0.21    0.00

This is the beginning of some bad code but not finished yet which why trying on here for some people who can script better than me

#!/usr/bin/perl -w

use strict;

my $idsfile = "RRESID.txt";
my $seqfile = "ProtComp_RRES_ALL_correct2.txt";
my %ids  = ();

open FILE2, $idsfile;
while(<FILE2>) {
  open FILE, $seqfile;
  while(<FILE>) {
    for (my $i=0; $i <= 10; $i++) {
      chomp;
      my @col = split /\t/;
      if($col[1]>0){
        if(i=0){
        }elsif(i=1){
        }elsif(i=2){
        }elsif(i=3){
        }elsif(i=4){
        }elsif(i=5){
        }elsif(i=6){
        }elsif(i=7){
        }elsif(i=8){
        }elsif(i=9){
        }

      if($col[2]>0){
        if(i=0){
        }elsif(i=1){
        }elsif(i=2){
        }elsif(i=3){
        }elsif(i=4){
        }elsif(i=5){
        }elsif(i=6){
        }elsif(i=7){
        }elsif(i=8){
        }elsif(i=9){
        }
      }
      $counter1=$counter1+1;
    }
  }
  close FILE;
  $counter1=$counter1+3;
}
close FILE2;

unix python perl • 3.0k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by rob234king ▴ 610

1

Entering edit mode

Could you indicate what you have tried, please

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by russhh 5.7k

0

Entering edit mode

Updated what I have so far but I'm just beginning and would take me a week as I'm slow at this :( I'll update what I have Friday and hopefully made more progress with it. I should say I'm posting to see if there is a simple script, command line code that I am not aware. I could only think of two loops using perl.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by rob234king ▴ 610

Ram · Accepted Answer · 2014-09-10

I'm not giving you the code, however, in perl (though I probably wouldn't do it this way anymore):

You don't really need the file1 mentioned in your original post (I noticed it has the handle FILE2 in your code, which is a bit confusing..). I'd probably reconsider using all those ifelses, you could split each dataline into the cellular space id and an array of numbers. Sum the numbers you're interested in and compare to 0.

Set up two variables $current_seq and $skip_flag. Determine what the current sequences name is from the first line of a given block in file2 and store it; if you hit a line that says 'I ought to disregard the current_seq id because it's present in the wrong part of the cell' set skip_flag to 1 - you can use this indicator to tell you that you don't need to consider anymore of the lines for the current_seq (so you don't print the same id multiple times) and that you don't need to ... Print the current_seq id when you get to a blank line, (and also reset current_seq and skip_flag).

Make sure you don't lose the final entry of your file